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ABSTRACT 


Automatic  text  document  classification  is  a  fundamental  problem  in  machine  learning.  Given 
the  dynamic  nature  and  the  exponential  growth  of  the  World  Wide  Web,  one  needs  the  ability 
to  classify  not  only  a  massive  number  of  documents,  but  also  documents  that  belong  to  wide 
variety  of  domains.  Some  examples  of  the  domains  are  e-mails,  blogs,  Wikipedia  articles,  news 
articles,  newsgroups,  online  chats,  etc.  It  is  the  difference  in  the  writing  style  that  differentiates 
these  domains.  Text  documents  are  usually  classified  using  supervised  learning  algorithms  that 
require  large  set  of  pre-labeled  data.  This  requirement,  of  labeled  data,  poses  a  challenge  in 
classifying  documents  that  belong  to  different  domains.  Our  goal  is  to  classify  text  documents 
in  the  testing  domain  without  requiring  any  labeled  documents  from  the  same  domain.  Our 
research  develops  specialized  cross-domain  learning  algorithms  based  the  distributions  over 
words  obtained  from  a  collection  of  text  documents  by  topic  models  such  as  Latent  Dirichlet 
Allocation  (LDA).  Our  major  contributions  include  (1)  empirically  showing  that  conventional 
supervised  learning  algorithms  fail  to  generalize  their  learned  models  across  different  domains 
and  (2)  development  of  novel  and  specialized  cross-domain  classification  algorithms  that  show 
an  appreciable  improvement  over  conventional  methods  used  for  cross-domain  classification 
that  is  consistent  for  different  datasets.  Our  research  addresses  many  real-world  needs.  Since 
massive  number  of  new  types  of  text  documents  is  generated  daily,  it  is  crucial  to  have  the  ability 
to  transfer  learned  information  from  one  domain  to  another  domain.  Cross-domain  classification 
lets  us  leverage  information  learned  from  one  domain  for  use  in  the  classification  of  documents 
in  a  new  domain. 
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CHAPTER  1 : 
Introduction 


Automatic  text  document  classification  is  a  fundamental  problem  in  machine  learning.  Text 
document  classification  is  as  old  as  writing  itself;  in  fact,  libraries  dating  back  to  2000  BC  in 
Syria  were  archiving  tens  of  thousands  of  clay  tablets  [1],  With  the  advent  of  paper  making,  a 
new  era  of  written  documents  started.  Institutions  such  as  the  library  of  Alexandria  organized 
hundreds  of  thousands  of  scrolls  in  their  archives  [2].  The  next  major  explosion  in  text  doc¬ 
uments  came  about  with  the  invention  of  the  printing  press;  the  text  documents  became  easily 
available  and  widely  distributed.  However,  as  the  number  of  text  documents  increased,  the  need 
for  robust  classification  became  more  crucial.  A  few  standards  of  text  organization  were  imple¬ 
mented  for  small  and  large  libraries  such  as  Dewey  Decimal  System  in  1876  [3]  and  library  of 
Congress  system  in  1897  [4]. 

We  are  recently  experiencing  a  third,  much  larger,  growth  in  the  number  of  text  documents 
with  the  advent  of  the  Internet  and  the  development  of  the  World  Wide  Web.  It  is  currently 
inconceivable  to  manually  classify  this  large  number  of  text  documents  that  are  generated  daily. 
Yahoo  Inc.  made  some  of  the  earliest  attempts  to  neatly  organize  all  the  information,  available 
on  the  World  Wide  Web,  in  a  directory  structure.  They  filled  the  much  needed  gap  in  the 
Web  organization,  thereby  helping  them  to  become  one  of  the  first  and  most  successful  Internet 
companies.  However,  Yahoo’s  directory  based  organization  system  soon  became  inadequate 
and  outdated  due  to  the  rapid  and  exponential  growth  in  the  number  of  the  text  documents. 
Therefore,  new  and  more  intelligent  ways  of  searching  and  organizing  text  documents  needed 
to  be  introduced.  Google  Inc.  arose  as  one  of  the  most  successful  companies  for  searching 
information  in  an  ever  growing  Web  of  text  documents. 

Given  the  dynamic  nature  of  the  Web,  the  solution  to  the  problem  of  text  document  classification 
needs  to  address  the  exponential  growth  of  the  Internet  and  wide  variety  of  text  documents.  This 
wide  variety  of  text  documents  is  a  result  of  documents  being  generated  in  various  domains. 
These  domains  can  be  e-mails,  blogs,  wiki  articles,  news  articles,  twitter  posts,  message  forums, 
online  chats,  speech  transcripts,  etc.  In  achieving  this  goal,  one  of  the  most  important  challenges 
is  the  problem  of  learning  topics  in  text  documents  that  belong  to  different  domains.  These 
domains  represent  information  in  different  ways,  each  serving  a  particular  purpose.  Often  a 
classification  scheme  that  works  well  in  one  domain  does  not  work  as  well  in  another.  Given 
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the  nature  of  current  text  document  classification  algorithms,  we  hypothesize  that  an  algorithm 
that  is  trained  to  classify  e-mails  well  may  not  be  able  to  classify  news  articles  with  the  same  or 
similar  accuracy.  We  confirm  this  hypothesis  with  our  experimental  results  in  this  thesis.  This 
drop  in  the  accuracy  while  going  from  one  domain  to  another  is  what  we  refer  to  as  cross-domain 
classification  problem.  This  dissertation  focuses  on  the  problem  of  cross-domain  classification. 
That  is,  we  use  classification  information  from  one  domain  and  apply  that  learned  information 
to  classify  text  documents  from  a  different  domain. 

For  ease  of  understanding,  we  illustrate  some  of  the  terms  used  in  this  dissertation  such  as 
domain,  topic  and  category  with  an  example  on  Wikipedia  and  New  York  Times  (NYT) 
newspaper.  The  articles  written  on  Wikipedia  covering  different  topics  constitute  a  domain 
and  the  articles  written  in  the  NYT  newspapers  by  professional  journalists  constitutes  another 
domain.  These  two  domains  may  share  the  subject  matter  e.g.,  both  may  contain  articles  on 
political  elections.  However,  each  domain  may  also  have  its  own  style  of  writing  while  covering 
the  particular  topic.  The  difference  in  the  writing  style  is  due  to  the  differences  in  the  editing 
process,  caliber  of  the  writers,  purpose  of  the  articles  etc.  It  is  this  difference  in  the  writing  style 
that  differentiates  these  two  domains  such  as  Wikipedia  and  NYT.  In  our  context,  topic  could 
have  two  meanings.  It  could  describe  the  category  of  the  document  or  describe  an  output  of  a 
topic  model.  We  will  generally  use  the  term  category  for  the  former.  So  an  article  discussing 
the  culinary  arts  of  Japan  will  belong  to  cooking  category  and  Japan  category.  Two  different 
domains,  e.g.,  Wikipedia  and  NYT,  may  both  have  articles  on  the  category  cooking. 

Text  documents  can  be  classified  either  by  using  supervised  learning  or  unsupervised  learning. 
Supervised  learning  method  classifies  the  documents  by  using  a  set  of  pre-labeled  data.  A  third 
class  of  learning  algorithm,  referred  to  as  semi-supervised  learning,  is  often  used  when  there 
is  lack  of  labeled  data.  Semi- supervised  learning  leverages  both  supervised  and  unsupervised 
learning  techniques.  Each  of  the  classification  algorithms  has  its  own  advantages  and  disad¬ 
vantages.  In  the  following  two  paragraphs,  we  introduce  supervised  and  unsupervised  learning 
algorithms  in  more  detail. 

In  supervised  learning,  an  algorithm  is  presented  with  a  set  of  correctly  “labeled”  data  that 
it  can  use  to  leam  the  distribution  of  the  data  into  different  classes.  Supervised  learning  is 
called  “supervised”  because  the  algorithm  is  given  external  input,  a  supervision,  about  what 
is  wrong  and  what  is  right,  i.e.,  it  is  being  supervised  by  the  labeled  data.  Consider  training 
an  algorithm  to  recognize  human  faces  in  images.  With  supervised  learning,  the  algorithm 
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will  be  provided  with  multiple  images  and  informed  whether  each  image  is  a  face  or  not.  The 
supervised  algorithm  then  leams  the  differentiating  characteristics  between  face  and  non-face 
images.  Once  these  characteristics  are  learned,  the  algorithm  can  be  used  to  classify  new  and 
unlabeled  images  into  face  and  non-face  categories.  Common  supervised  learning  algorithms 
include  artificial  neural  networks  (ANN)  [5],  support  vector  machines  (SVMs)  [6],  k-nearest 
neighbors  (KNN)  [7]  and  naive  Bayes  (NB)  [8]. 

In  unsupervised  learning,  labeled  data  is  not  available.  Instead  of  putting  items  into  fixed  cate¬ 
gories,  the  algorithm  tries  to  discover  patterns  of  clusters.  Using  the  previous  example,  the  al¬ 
gorithm  is  given  the  set  of  images  of  faces  and  non-faces  but  without  the  information  of  which 
image  is  a  face  or  not  a  face.  Ideally,  the  algorithm  will  find  two  distinct  clusters:  a  cluster 
containing  all  the  face  images  and  another  cluster  with  all  the  non-face  images.  However,  it  is 
more  likely  that  the  algorithm  will  end  up  finding  many  different  clusters  each  capturing  some 
aspect  of  an  image.  As  one  can  see,  outcome  of  unsupervised  algorithm  can  be  unpredictable; 
therefore  it  can  reveal  patterns  that  were  previously  unknown.  For  example,  an  unsupervised 
learning  algorithm  can  reveal  patterns  in  the  gene  expression,  in  the  subjects  of  a  certain  ge¬ 
netic  disorder,  which  could  not  be  detected  by  manual  inspection  of  the  data.  These  algorithms 
are  also  often  used  for  applications  such  as  e-discovery  used  to  locate  electronic  data  for  le¬ 
gal  cases.  Common  unsupervised  algorithms  include  k-means  [9],  expectation-maximization 
algorithm  for  mixture  of  Gaussians  (EM)  [10],  latent  Dirichlet  allocation  topic  (LDA)  model 
[11].  Outcome  of  an  unsupervised  algorithm  is  often  a  probability  distribution  of  the  given  data 
points.  In  our  example,  each  image,  face  or  not  face,  will  constitute  a  data  point  in  this  distri¬ 
bution.  The  models  learned  by  probabilistic  unsupervised  algorithms  are  also  called  generative 
models  because  they  provide  a  method  to  generate  new  data  points  by  revealing  the  underlying 
graphical  model.  However,  it  is  important  to  note  that  not  all  generative  models  are  generated 
by  probabilistic  unsupervised  methods. 

Both  supervised  and  unsupervised  algorithms  present  their  own  advantages  and  disadvantages 
based  on  a  particular  application.  As  with  the  image  classification  example,  the  unsupervised 
learning  algorithm,  in  most  cases,  will  not  detect  patterns  in  the  data  according  to  perceived  la¬ 
bels  assigned  by  humans.  Since  there  is  no  external  input  directing  the  unsupervised  algorithm 
to  learn  the  differences  between  pre-determined  classes,  unsupervised  algorithms  often  learn 
other  subtle  patterns  that  may  or  may  not  be  useful.  In  summary,  unsupervised  learning  algo¬ 
rithms  can  reveal  interesting  and  unexpected  patterns.  Supervised  learning  algorithms  can  learn 
the  differences  between  pre-determined  classes,  however,  they  present  their  own  disadvantages 
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by  requiring  a  large  number  of  labeled  data  points.  In  many  cases,  obtaining  a  large  dataset  that 
has  been  carefully  labeled  is  not  feasible.  Supervised  learning  algorithm  forces  itself  to  learn 
any  pattern  given  and  generalizes  it  to  classify  the  new  data.  If  the  data  was  not  carefully  labeled 
or  the  number  of  labeled  data  points  was  very  small,  the  performance  of  the  supervised  learning 
algorithm  on  new  and  unlabeled  data  is  often  very  poor  [12],  [13].  In  summary,  supervised 
learning  algorithms  offer  a  way  to  classify  information  in  pre-determined  categories  given  a 
carefully  labeled  large  number  of  data  points. 

We  will  illustrate  the  use  of  learning  algorithms  for  our  cross-domain  classification  task  with  an 
example.  Suppose  that  we  want  to  classify  articles  from  NYT  into  one  of  the  pre-determined 
categories.  In  this  case,  NYT  will  be  our  testing  domain.  In  order  to  classify  NYT  articles,  we 
will  use  labeled  articles  from  Wikipedia  to  learn  a  classifying  model.  In  this  case,  Wikipedia 
will  be  our  training  domain.  A  classifying  model  is  a  function  that  takes  an  unlabeled  article 
as  its  input  and  returns  a  category  label  as  the  output. 

Text  document  classification  often  requires  one  to  classify  documents  in  specific,  pre-determined, 
categories.  At  the  same  time,  we  lack  the  ability  to  generate  large  amount  of  labeled  data  for 
documents  especially  with  the  new  topics  and  new  domains.  The  desire  to  classify  documents  in 
pre-determined  categories  poses  problems  for  unsupervised  learning,  while  the  lack  of  labeled 
data  is  problematic  for  supervised  learning.  It  is  sometimes  the  case,  as  shown  in  this  disser¬ 
tation,  that  supervised  algorithms  taught  to  classify  documents  in  one  domain  cannot  classify 
similar  documents  in  different  domains.  We  show  that  when  a  conventional  supervised  algo¬ 
rithm,  e.g.,  SVM,  is  trained  to  classify  news  articles  from  New  York  Times  (NYT)  newspaper, 
it  classifies  NYT  articles  with  good  accuracy.  However,  when  that  learned  classifier  is  used 
to  classify  Wikipedia  articles,  the  accuracy  of  classification  drops  significantly.  In  order  for  a 
supervised  algorithm  to  work  well  for  multiple  domains,  it  has  to  be  trained  again  and  again 
for  each  new  domain  that  will  be  classified.  Each  time  it  needs  a  new  labeled  set  of  documents 
from  the  training  domain.  Our  algorithms,  developed  as  part  of  this  research,  are  able  to  learn 
the  classifying  model  from  one  domain  (training  domain)  and  use  it  on  a  different  domain  (test¬ 
ing  domain)  with  appreciable  improvement  in  the  accuracy  when  compared  with  conventional 
algorithms. 

Our  goal  is  to  classify  text  documents  in  a  new  (testing)  domain  by  using  labeled  set  of  doc¬ 
uments  from  a  different  domain  (training).  In  other  words,  we  want  to  be  able  to  classify  text 
documents  in  a  new  domain  without  requiring  any  set  of  labeled  documents  from  this  domain. 
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This  is  what  we  refer  to  as  the  cross-domain  classification.  Our  research  develops  specialized 
cross-domain  learning  algorithms  to  accomplish  this  goal.  Major  contributions  of  this  disserta¬ 
tion  include: 


•  Define  the  cross-domain  problem  and  establish  new  framework  to  tackle  the  problem.  We 
also  present  our  framework’s  relation  to  the  previous  work  done  in  the  related  field; 

•  Gather  evidence  that  suggests  that  cross-domain  classification  is  more  challenging  prob¬ 
lem  than  single  domain  classification  problem.  We  present  empirical  results  showing  a 
significant  drop  in  classification  accuracy  when  different  domains  are  used  for  testing  and 
training; 

•  Analysis  of  the  cross-domain  classification  problem  by  tabulating  the  reasons  of  this  con¬ 
siderable  drop  in  the  classification  accuracy; 

•  Novel  and  specialized  cross-domain  classification  algorithms  that  show  an  appreciable 
improvement,  over  conventional  methods  used  for  the  same  task,  that  is  consistent  for 
different  datasets; 

•  Development  and  presentation  of  a  new  dataset  that  can  be  useful  for  many  researchers 
working  in  the  field  of  cross-domain  classification. 


To  understand  the  challenges  of  cross-domain  classification  task,  we  first  start  by  performing 
experiments  using  Wikipedia  as  our  only  training  domain.  This  research  provides  us  a  tremen¬ 
dous  insight  about  the  cross-domain  classification  task.  While  working  with  Wikipedia,  we  only 
develop  algorithms  that  specifically  use  Wikipedia  as  the  training  set  to  classify  text  documents 
in  any  other  domain.  These  initial  algorithms  are  customized  for  Wikipedia  and  they  utilize  the 
unique  properties  of  Wikipedia  articles  such  as  the  presence  of  external  and  internal  links.  We 
leam  from  these  experiments  and  then  extend  our  work  to  be  more  general,  that  is,  our  work 
consists  of  general  set  of  algorithms  that  are  not  tied  down  to  any  particular  training  domain. 

In  our  algorithms,  we  utilize  both  supervised  and  unsupervised  algorithms  to  accomplish  the 
cross-domain  classification  task.  We  use  some  of  the  common  supervised  algorithms  such  as  k- 
nearest  neighbors,  support  vector  machines  and  naive  Bayes.  The  unsupervised  algorithms  used 
in  our  methods  are  primarily  based  on  topic  models  such  as  Latent  Dirichlet  Allocation  (LDA) 
and  probabilistic  Latent  Semantic  Analysis  (pLSA).  We  go  over  two  different  ways  of  using 
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LDA  for  classification,  (1)  using  the  vectors  describing  the  distribution  of  topics  in  documents  ( 
7  vectors)  and  (2)  using  the  topic  as  distribution  of  words  (J3  vectors).  We  describe  the  methods 
and  distance  metrics  for  our  methods  and  give  empirical  results. 

Topic  models  are  generative  models  for  a  collection  of  text  documents.  They  assume  that  doc¬ 
uments  may  contain  a  mixture  of  topics,  where  topics  are  defined  as  distributions  over  words. 
There  are  many  topic  models  that  developed  to  model  text  document  collections,  each  making 
different  assumptions  about  the  relationships  between  topics,  words,  documents  and  their  un¬ 
derlying  distributions.  Most  topic  models  operate  in  unsupervised  learning  framework  i.e.  they 
look  to  extract  generative  distribution  of  the  collection  instead  of  focusing  on  dividing  docu¬ 
ments  into  distinct  classes.  There  are,  however,  enough  variations  of  the  topic  models  that  try  to 
incorporate  the  external  labels  provided  to  the  algorithm  in  shaping  the  underlying  distribution. 
In  this  way,  these  algorithms  work  in  a  more  semi-supervised  manner.  Our  newly  developed  al¬ 
gorithms,  in  this  dissertation,  do  not  develop  new  generative  models.  There  are  many  variations 
of  the  topic  models  developed  or  modified  for  different  purposes.  Our  algorithms  concentrate  on 
utilizing  the  information  extracted  from  a  topic  model  in  a  way  that  is  useful  for  cross-domain 
classification.  We  provide  both  theoretical  and  empirical  justification  for  the  robustness  of  our 
algorithms. 

For  developing  and  verifying  the  effectiveness  of  our  cross-domain  classification  algorithms, 
we  use  text  documents  from  three  different  domains  for  our  experiments:  (1)  Wikipedia,  (2) 
New  York  Times  newspaper  and  (3)  newsgroups.  All  three  domains  are  used  as  target  and 
testing  domains  at  different  times  to  show  the  consistency  and  robustness  of  our  algorithms  in 
cross-domain  classification. 

Cross-domain  classification  has  many  real  world  applications.  Given  the  dynamic  nature  of  the 
Web,  massive  number  of  text  documents,  belonging  to  new  and  wide  variety  of  domains,  is 
generated  daily.  In  this  environment,  it  is  essential  to  be  able  to  transfer  learned  information 
from  one  domain  to  another  domain.  If  one  wants  to  classify  documents  consisting  of  text 
chat  transcripts,  conventional  algorithms  will  require  a  large  number  of  labeled  chat  transcripts, 
before  being  to  classify  them.  Such  a  labeled  data  may  be  very  difficult  to  obtain  or  may  not 
even  exist  at  all.  Our  algorithms,  on  the  other  hand,  will  be  able  to  use  the  information  learned 
from  a  different  domain,  such  as  Wikipedia  and  will  be  able  to  classify  the  chat  transcripts 
without  requiring  any  labeled  chat  data.  In  short,  cross-domain  classification  lets  us  leverage 
information  learned  from  one  domain  for  use  in  the  classification  of  documents  in  a  new  or  a 
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different  domain. 


This  thesis  is  organized  as  follows:  Chapter  2  discusses  the  related  work  in  the  field  of  cross¬ 
domain  classification.  Chapter  3  outlines  some  machine  learning  algorithms  utilized  in  this 
research.  Chapter  4  gives  the  details  of  the  data  sets  used  in  our  research.  Chapter  5  presents 
our  experiments  with  Wikipedia  as  the  chosen  training  domain.  It  concentrates  on  techniques 
of  parsing  and  using  different  sections  of  the  Wikipedia  articles  to  leam  a  document  classifier. 
Chapter  6  presents  the  empirical  results  of  cross-domain  classification  using  conventional  meth¬ 
ods.  It  shows  a  considerable  drop  in  the  accuracy  when  different  domains  are  used  for  training 
and  testing.  Chapter  7  describes  our  algorithms  for  cross-domain  classification  along  with  the 
experimental  results  and  the  analysis  of  these  results.  We  run  Wilcoxon  rank  sum  statistical 
tests  to  show  the  statistical  significance  of  our  results  presented  in  Chapter  7.  These  results 
are  shown  in  Appendix  A  of  this  dissertation.  We  conclude  with  future  work  in  the  area  of 
cross-domain  classification  in  Chapter  8. 
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CHAPTER  2: 
Related  Work 


Our  research  touches  on  two  main  research  areas:  machine  learning  and  text  document  classi¬ 
fication.  Our  goal  is  to  accomplish  cross-domain  text  document  classification  using  machine 
learning  algorithms.  This  goal  leads  us  to  explore  a  wide  array  of  research  areas  including  ex¬ 
periments  with  Wikipedia  data,  theoretical  distance  measures  of  probability  distributions  and 
various  topic  models  for  text  documents.  We  gained  substantial  insights  from  the  related  work 
discussed  in  this  chapter.  This  chapter  discusses  the  related  work  that  includes:  (1)  classifi¬ 
cation  of  text  documents  using  Wikipedia,  (2)  topic  models  for  supervised  learning,  (3)  using 
unsupervised  topic  models  for  classification,  (4)  methods  for  cross-domain  knowledge  transfer 
and  (5)  topic  models  for  cross-domain  classification.  Some  of  the  common  machine  learning 
algorithms  such  as  Latent  Dirichlet  Allocation  (LDA),  Support  Vector  Machines  (SVMs),  and 
Naive  Bayes  (NB)  will  be  introduced  in  Chapter  3. 

2.1  Classification  of  Text  Documents  Using  Wikipedia 

Wikipedia,  in  the  last  few  years,  has  emerged  as  an  invaluable  resource  of  text  documents. 
Wikipedia  offers  a  thorough  organizational  structure  of  the  articles  in  various  categories.  Each 
article  belongs  to  one  or  more  categories,  although  there  may  be  articles  without  a  category. 
The  article,  itself,  is  also  organized  into  different  sections  such  as  introduction,  references, 
notes  etc.  Wikipedia  has  been  analyzed  in  various  publications  [14,  15,  16,  17,  18].  While 
some  of  these  papers  have  concentrated  on  the  evolution  of  Wikipedia  and  social  aspects  of 
the  Wikipedia  community  [17,  18],  we  are  more  interested  in  researching  the  organizational 
structure  of  its  articles.  Given  the  structure  and  the  size  of  Wikipedia,  there  is  also  a  tendency 
to  use  it  for  text  document  classification.  There  are  many  different  ways  to  use  Wikipedia  in 
aid  of  text  document  classification.  Among  these  different  ways,  one  of  the  prominent  ways 
is  using  Wikipedia  as  a  source  of  semantic  background  knowledge.  As  a  semantic  background 
knowledge  source,  Wikipedia  is  used  to  augment  another  text  data  [15,19].  Our  goal  is  to  exploit 
the  organizational  structure  of  Wikipedia  and  its  articles  for  text  document  classification.  The 
following  paragraphs  will  introduce  these  related  works  in  more  detail. 

Wang  et  al.  attempted  to  improve  the  accuracy  of  text  document  classification  by  augmenting 
their  training  data  with  Wikipedia  content  [19].  The  authors  first  created  the  feature  vectors, 
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which  are  the  word  counts  in  the  documents  using  the  bag-of-words  model.  The  authors  then 
incorporated  word  relations:  synonymy,  polysemy,  and  hyponymy,  into  this  feature  vector.  This 
is  done  by  simply  adding  the  counts  of  the  related  words  in  the  feature  vector,  however,  the  au¬ 
thors’  use  of  Wikipedia  is  limited,  since  their  approach  concentrated  only  extracting  the  three 
different  types  of  relationships  among  words.  They  extracted  the  word  relations  using  the  hy¬ 
perlinks  between  the  Wikipedia  articles  and  did  not  use  any  other  part  of  the  Wikipedia  data. 
The  authors  have  also  published  a  similar  paper  presenting  a  way  of  using  Wikipedia  to  develop 
a  thesaurus  [20]. 

Gabrilovich  et  al.  had  an  approach  that  is  more  elaborate  in  its  use  of  Wikipedia  text  [21]. 
The  authors’  approach  not  only  concentrated  on  explicitly  defined  lexical  relationships  among 
the  words,  but  also  used  the  document  text  itself  to  augment  the  feature  vectors.  In  their  ex¬ 
periments,  they  showed  a  modest  but  consistent  improvement  in  the  classification  accuracy  of 
various  datasets  including  20-newsgroups  [22],  where  the  accuracy  increased  from  85.4%  to 
86.2%. 

Schonhofen  presented  a  way  of  using  Wikipedia  categories  in  the  classification  of  text  docu¬ 
ments  [23].  The  author  did  not  utilize  the  Wikipedia  article  text  in  his  classification  algorithm; 
he  only  used  the  category  labels  and  the  titles  of  the  articles.  He  experimented  by  classifying 
Wikipedia  articles  as  well  as  the  20-newsgroup  articles.  Their  20- newsgroup  classification  did 
show  a  modest  improvement  when  it  is  performed  using  the  Wikipedia  information. 

Wang  et  al.  improved  on  the  results  of  co-clustering  algorithm  [24]  by  using  the  Wikipedia  text 

[25]  to  augment  their  training  data.  The  authors  also  used  20-newsgroups  data  and  split  this  data 
into  two  domains.  They  used  some  of  the  newsgroups  as  first  domain  and  some  other  similar 
newsgroups  to  represent  the  corresponding  second  domain.  As  an  example  from  the  research 
article,  the  first  domain  contained  newsgroups  such  as  recreation.autos  and  recreation.baseball, 
while  the  second  domain  contained  recreation.motorcycles  and  recreation. hockey. 

Medelyan  et  al.  have  written  a  survey  article  of  research  on  Wikipedia  for  text  mining  purposes 

[26] .  The  authors  focused  this  survey  on  the  research  that  extracts  and  makes  use  of  the  con¬ 
cepts,  relations,  facts  and  descriptions  found  in  Wikipedia.  They  organized  the  work  into  four 
broad  categories:  applying  Wikipedia  to  natural  language  processing,  using  it  to  facilitate  infor¬ 
mation  retrieval,  information  extraction,  and  as  a  resource  for  ontology  building.  This  survey  is 
a  great  resource  for  researchers  who  are  interested  in  using  Wikipedia  for  text  mining. 
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Using  data  from  Wikipedia  to  enhance  an  already  labeled  text  data  has  been  researched  by  many 
as  we  have  shown  in  this  section.  However,  using  different  sections  within  a  Wikipedia  article 
for  the  purpose  of  text  mining  has  not  been  given  as  much  attention.  In  our  research  we  fill  this 
gap  in  the  literature.  In  our  Wikipedia  research,  we  parse  and  analyze  different  parts  of  a  typical 
Wikipedia  article,  which  we  refer  to  as  sections  of  the  article.  Chapter  5  discusses  the  properties 
of  these  individual  sections.  We  present  our  experimental  results  in  cross-domain  classification, 
where  Wikipedia  text  is  used  to  enhance  the  classification  of  documents  in  a  different  domain. 

2.2  Topic  Models  for  Supervised  Learning 

Topic  models  are  generally  unsupervised  probabilistic  generative  models  based  on  a  Bayesian 
network  and  are  often  used  for  modeling  a  document  collection.  Different  topic  models  are 
developed  by  modifying  and/or  extending  the  underlying  Bayesian  networks.  For  example,  the 
LDA  model  extends  the  probabilistic  Latent  Semantic  Analysis  (pLSA)  model  by  introducing 
Dirichlet  priors  over  the  topic  distributions.  As  one  can  imagine,  many  topic  models  have  been 
developed  over  the  years  by  modifying  other  topic  models  in  different  ways.  Some  of  these 
modifications  are  done  to  develop  topic  models  for  supervised  learning.  However,  a  common 
theme  is  shared  by  all  topic  models.  Heinrich  presented  the  underlying  commonality  of  topic 
models  elegantly  in  his  article  titled  “A  Generic  Approach  to  Topic  Models”  [27]. 

Even  though  most  topic  models  remain  unsupervised,  in  this  section,  we  will  discuss  some 
of  the  topic  models  that  use  the  document  labels.  The  label  of  a  document  is  inserted  in  the 
Bayesian  network  as  an  obserx’ed  node  and  influences  the  probabilities  and  distribution  of  the 
words  in  topics.  This  is  the  case  with  the  models  in  [28,  29,  30,  31,  32,  33]  and  many  others. 
Some  of  these  papers  such  as  [32,  33]  do  not  address  the  text  document  classification  problem 
in  general  but  use  the  model  to  tackle  problems  of  segmentation  and  recognition  of  the  objects 
in  images  and  video. 

Andrzejewski  et  al.  incorporated  supervision  into  the  LDA  model  [31].  This  is  done  by  re¬ 
stricting  a  set  of  the  words  from  a  given  set  of  topics.  For  example,  topic  set,  {1,2},  may  not 
contain  the  set  of  words  such  as  {british,  uk,  wales,  london}.  Authors  named  this  model  as 
LDA-topic-in-subset  (LDA-TIS).  More  specifically,  LDA-TIS  constrains  a  set  of  words  to  be 
only  assigned  to  specific  subset  of  topics.  Authors  have  also  published  a  previous  paper  to  find 
bugs  in  programming  code,  where  they  partition  the  topics  into  two  sets:  (1)  usage  topics  which 
can  appear  in  all  documents,  and  (2)  bug  topics  which  can  only  appear  in  a  special  subset  of 
documents  [34]. 
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The  authors  added  the  TIS  information  into  LDA  via  a  modification  of  the  Gibbs  sampling 
equations,  which  are  shown  below: 


Ed  Em/n  1  (Wt  =t)l  (zi  =  k)+1\  d 

Ed,m,t  1 «  =  0 1  (4  =  k)  +  r7  J  ^  ^ 


(2.1) 


and  then  defining  p  as  follows  by  adding  a  parameter  rj  to  control  the  strength  of  the  constraint: 


p(zn  =  k  I  Z^n,  W )  oc  (77  5  (k  E  C(n) )  +  (1  -  77))  (2.2) 

where  Ckn'!  is  the  set  of  possible  values  for  the  latent  topic  zn  for  word  wn.  The  indicator 
function  5  (k  €  C^)  takes  a  value  of  1  if  k  e  C ^  and  0  otherwise.  For  example,  if  we  wish  to 
restrict  zn  to  a  single  value,  this  can  be  accomplished  by  setting  the  set  C ^  to  topic  5.  Likewise, 
we  can  set  C (")  to  a  subset  of  topics,  e.g.,  (1,  2,  3},  or  to  the  whole  subset  (1,  2, ... ,  K]  in  which 
case  the  modified  sampling  reduces  to  the  standard  collapsed  Gibbs  sampling.  The  formation 
above  gives  a  flexible  way  to  insert  prior  knowledge  into  the  inference  of  latent  topics.  We  can 
set  C(n)  individually  for  every  word  wn  in  the  corpus.  This  allows  us  to  force  two  occurrences 
of  the  same  word  in  a  document  to  be  explained  by  different  topics.  This  effect  indeed  would 
be  impossible  to  achieve  using  standard  LDA.  The  authors  performed  two  experiments  which 
are  explained  in  the  following  paragraphs  in  more  detail. 

The  first  experiment  is  performed  using  a  corpus  of  9000  yeast-related  abstracts.  The  authors 
restricted  the  topics  for  a  set  of  seed  words,  translation,  trna,  anticodon,  ribosome,  to  be  topic 
0,  for  all  occurrences  of  the  seed  words.  Their  goal  was  to  discover  if  LDA-TIS  can  guide  the 
topic  discovery  to  be  more  related  to  the  user-seeded  concepts.  The  authors  compared  the  topics 
learned  from  standard  LDA  with  LDA-TIS  and  found  that  LDA-TIS  is  able  to  group  all  terms 
relevant  to  the  seeded  words  to  topic  0,  while  standard  LDA  ends  up  splitting  the  relevant  terms 
between  three  different  topics. 

Their  second  experiment  was  similar  to  the  first,  but  authors  performed  this  experiment  using 
the  Reuters  newswire  corpus  [35].  The  difference  between  the  two  experiments  was  that  the 
seeded  terms  (britain,  british,  uk,  u.k.,  wales,  scot-land,  london}  are  now  allowed  to  be  in  a 
subset  of  topics  (topics  1,2,3)  instead  of  being  restricted  to  only  one  topic.  In  addition,  all 
the  other  location-related  terms  are  not  allowed  to  be  in  the  first  three  topics.  While  LDA-TIS 
uncovered  the  first  three  topics  to  be  related  to  the  location  “United  Kingdom”,  the  topics  are 
also  split  nicely  into  business  (topic  1),  cricket  (topic  2),  and  soccer  (topic  3).  In  contrast,  the 
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standard  LDA  model  was  not  able  to  discover  such  interesting  topic  patterns.  Even  though 
the  LDA-TIS  has  promising  results  for  this  particular  application,  it  could  not  repeat  the  same 
success  with  other  locations  related  seed  terms  such  as  China,  United  States  and  Germany.  In 
summary,  the  LDA-TIS  cannot  be  very  successful  if  the  users’  target  concepts  are  not  prevalent 
in  the  text  corpus. 

Blei  et  al.  have  also  proposed  a  supervision  modification  of  their  LDA  model  called,  Supervised 
LDA  (sLDA)  [28].  The  authors  performed  a  small  modification  on  the  LDA  model  by  adding  a 
response  variable  to  the  model  that  depends  on  the  document  and  the  topic  distribution  of  that 
document.  The  sLDA  model  can  be  written  as 


P  (xi:N,y\a,/3i:k,ri,cr2) 


N 


P  (tt |  a)  e  n  p  (zn |7r)  p  {xn\zn,  /3i:k)  p  (; y\zi:n ,  rj,  a'2)  dir 


(2.3) 


Zl:N  \n=  1 


where  the  variables  in  the  Equation  2.3  are  taken  from  the  LDA  model  as  described  in  the 
Background  chapter.  Under  sLDA,  one  generates  the  documents  the  same  way  as  she  would 
under  a  normal  LDA.  Once  the  topic  proportions  (Zn)  are  generated,  you  draw  a  response 
variable  y.  y  is  a  real-value  random  variable  and  y\zi-N,y,<72  ~  N  ^]Tz,o2)  where  z  :  = 
(1  /N)  J2n= i  zn  and  V  is  an  observed,  learned,  parameter  for  the  response  variable. 

Their  experiments  used  the  star  ratings  from  movie  reviews  data  [36]  and  number  of  votes 
each  link  received  from  Digg. corn’s  data  [28]  as  the  response  variables.  Since  the  distri¬ 
bution  of  the  response  variable  in  the  experimental  datasets  was  not  normal,  they  attempted 
to  achieve  normality  by  taking  the  log  of  these  numbers.  They  evaluated  the  performance 
of  the  model  by  computing  R2,  which  they  called  predictive  R2,  evaluated  as  pR2  :=  1  — 
(J>2  (y  —  y )2)  /  (J2  (y  —  y)2).  They  compared  their  results  of  pR2  with  the  L1  regularized 
least-squares  regression  and  showed  8%  to  9%  improvement.  As  can  be  seen,  the  sLDA  pro¬ 
poses  a  rather  simple,  normally  distributed,  continuous  response  variable  that  may  not  be  suit¬ 
able  for  many  real  world  classification  tasks. 


Shan  et  al.  proposed  some  changes  to  the  sLDA  model,  primarily  in  the  distribution  of  the 
response  variable  [30].  The  authors  developed  two  new  models,  one  by  modifying  the  response 
variable  in  sLDA  and  another  by  modifying  the  response  variable  in  their  own  topic  model, 
Latent  Dirichlet  conditional  naive  Bayes  model,  [37].  The  authors  made  the  response  variable 
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more  suitable  for  classification  labels,  which  are  rarely  distributed  normally  and  are  almost 
never  continuous  or  real  valued.  The  modification  to  the  LDA  with  their  response  variable  was 
called  discriminative-LDA  (DLDA). 

Here  we  will  briefly  discuss  their  models  and  compare  it  with  sLDA,  which  was  introduced  in 
the  previous  paragraph.  The  DLDA  model  is  written  as  below. 


p(xi:N,y\a,/3i:k,rii:c-i 


N 


P  (tt|q;)  e  n  P  (zn\7r)  p  (xn\zn,  f3i:k)  )  p(y\zi:N,rjl  :c—l)dn  (2.4) 

Zi;N  \n=l 


where  the  variables  in  the  Equation  2.4  are  taken  from  the  LDA  model  as  described  in  the 
Background  chapter.  The  only  difference  when  compared  with  sLDA  is  that  in  sLDA  (eq.  2.3) 
the  response  variable  y  depends  on  a  scalar  r/,  a  squared  and  z.  The  response  variable  y  in  their 
model  is  given  by  a  multi-class  logistic  regression  as  shown  below. 


y  ~  LR 


exP  (Vh 


1  +  EL1!  expL 


(2.5) 


where  z  is  an  average  of  z]:N  over  all  observed  words,  where  each  zn  is  a  A-dimcnsional 
unit  vector  with  only  the  ith  entry  being  1  if  it  denotes  the  ith  component.  The  categorical 
response  variable  y  can  be  considered  as  a  sample  generated  from  the  Discrete  distribution 
ipi,  ■  ■  ■  ,Pc- 1, 1— ELi  Ph)  Where  ph  =  T  In  two-class  classification,  y  is  0  or  1 

i+Z^=i  exPClh  z) 

generated  from  Bernoulli  ^  1+exp*_??r ^  j ,  be.  the  model  needs  only  one  rj  in  the  two-class  case. 

The  authors  illustrated  the  versatility  of  their  response  variables  by  assigning  labels  to  different 
type  of  classification  tasks.  They  used  nine  different  datasets  from  UCI  datasets  and  showed 
that  their  results  were  competitive  to  conventional  classification  algorithms  such  as  naive  Bayes 
and  SVMs  but  in  most  cases  they  did  not  perform  as  well  as  them. 


2.3  Using  Unsupervised  Topic  Models  for  Classification 

This  section  differs  from  the  previous  section  titled  “Topic  Models  for  Supervised  Learning.” 
This  section  explores  innovative  ways  of  using  unsupervised  topic  models,  such  as  LDA  and 
pLSA,  for  classification  purposes.  The  previous  section,  on  the  other  hand,  explored  some  new 
topic  models,  themselves,  that  learn  in  supervised  setting.  Surprisingly,  the  field  of  using  the 
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output  of  an  unsupervised  topic  model,  such  as  LDA,  for  classification  has  not  received  as  much 
attention  as  the  development  of  new  supervised  topic  models.  Our  research,  in  this  dissertation, 
specifically  uses  LDA  and  pLSA  for  classification  of  the  text  documents  across  domains.  We 
show  encouraging  experimental  results  to  strengthen  the  idea  that  research  in  using  unsuper¬ 
vised  topic  models  for  classification  deserves  more  attention  than  just  the  research  on  new  topic 
models.  This  section  will  present  some  papers  that  use  LDA,  which  is  an  unsupervised  topic 
model,  for  classification. 

Blei  et  al.,  while  introducing  Latent  Dirichlet  Allocation  (LDA)  model,  showed  the  effectiveness 
of  their  model  by  comparing  a  perplexity  measure  against  pLSA  [11].  Perplexity  measure  can 
be  interpreted  as  the  likelihood  of  the  previously  unseen  documents.  The  exact  formulation  of 
perplexity  measure  for  M  documents  from  the  test  set  Dtest  is  shown  in  the  equation  below: 

perplexity {Dtest )  =  exp{- }  (2.6) 

where  Nd  is  the  number  of  words  and  p{ wd)  is  the  word  probability  vector  for  document  d. 
Authors  showed  that  LDA  model  outperformed  the  pLSA  model  by  assigning  more  likelihood 
to  the  held  out  documents  that  belonged  to  the  same  set  as  the  one  used  for  training  the  model. 
The  authors  concluded  that  this  was  due  to  the  fact  that  the  LDA  model  uses  Dirichlet  priors 
over  the  topic  distribution  instead  of  learning  the  topic  distribution  directly  from  the  documents 
only.  This  makes  LDA  less  biased  towards  the  training  set  and  results  in  a  more  generalizable 
model.  In  their  second  experiment,  they  did  use  LDA  for  classification  to  show  its  usability 
for  classifying  documents.  Their  experiments  were  extended  in  another  paper  by  Li  et  al.  that 
studied  using  LDA  for  classification  [38]. 

Li  et  al.  laid  out  empirical  results  of  text  classification  using  LDA  [38].  Their  approach  was 
the  same  as  Blei  et  al.,  however,  their  paper  focused  on  the  empirical  results  of  using  LDA  for 
text  document  classification.  The  authors  used  the  topic  distribution  vectors,  gamma  parame¬ 
ters,  as  the  feature  vectors  for  the  documents.  Gamma  represents  the  topic  distribution  over  a 
document;  it  can  also  be  seen  as  a  lower  dimensional  representation  of  a  document  in  terms  of 
the  topics  learned  using  the  LDA.  They  used  SVM  as  the  classifier  for  these  gamma  feature  vec¬ 
tors  and  Reuter’s  dataset  for  their  experiments.  Their  results  showed  that  tf.idf  formulation  for 
the  feature  vector  does  better  than  the  LDA  gamma  vectors.  LDA  based  feature  set  performed 
better  only  if  less  than  10%  of  the  training  set  is  used. 
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2.4  Methods  for  Cross-domain  Knowledge  Transfer 

There  are  recent  papers  that  try  to  define  and  tackle  the  problem  of  cross-domain  classification. 
The  definition  of  the  cross-domain  classification,  however,  varies  from  paper  to  paper.  It  is  im¬ 
portant  to  clarify  these  differences  in  the  definition  of  domains  as  used  by  different  researchers. 
In  our  case,  the  distinction  between  two  domains  is  due  to  the  difference  in  the  underlying 
writing  style.  Examples  of  different  domains  in  our  research  are  news  articles,  e-mails  and 
Wikipedia  articles  etc.  Our  definition  of  domains  closely  resembles  the  concept  often  described 
by  the  word  genre. 

Dai  et  al.  defined  the  domains  as  two  similar  topics  but  from  the  same  source  of  text  [24]. 
Although  this  definition  is  significantly  different  from  ours,  it  is  still  an  interesting  research 
paper  due  to  the  fact  that  they  also  used  documents  from  one  set  to  classify  documents  from 
another  set.  The  two  domains  created  by  the  authors  did  not  differ  in  terms  of  their  genre  of 
the  text,  but  they  differed  in  terms  of  the  subject  of  the  text.  They  used  20-newsgroups,  SRAA 
[39],  and  Reuter’s  datasets.  As  an  example,  after  splitting  the  20-newsgroups  dataset  into  two 
domains,  one  set  contained  documents  on  subject  recreation. hockey  and  another  set  contained 
documents  on  recreation. baseball.  While  splitting  the  Reuters  dataset  [35]  in  different  domains, 
they  selected  the  documents  in  orgs  class  as  one  domain  and  documents  in  people  class  as  the 
second  domain.  As  can  be  observed,  these  different  subjects  that  are  split  across  different 
domains,  share  a  common,  broader  underlying  topic. 

Another  similarity  of  this  paper  to  our  research  is  that  they  also  compared  and  showed  an  im¬ 
provement  in  classification  accuracy  over  more  conventional  machine  learning  methods  such 
as  naive  Bayes  and  SVMs  etc.  Their  algorithm  used  co-clustering  algorithms  presented  by 
Dhillon  et  al.  [40].  The  co-clustering  algorithm,  as  described  in  Dhillon’s  paper,  works  on 
a  word  count  matrix,  where  each  row  represents  a  document  and  each  column  represents  the 
words.  Each  entry  in  the  matrix,  thus,  represents  the  counts  of  the  words  in  that  particular  doc¬ 
ument.  Co-clustering  algorithms  groups  these  rows  and  columns  together  simultaneously  into 
pre-determined  number  of  clusters.  Assuming  that  the  row  values  are  from  a  random  variable  X 
and  column  values  are  a  random  variable  Y,  Co-clustering  algorithm  seeks  to  find  the  following 
mapping,  that  is  grouping  A"  values  into  k  clusters  and  Y  values  into  l  clusters: 


Cx  :  {xi,x2,  -)•  {xi,x2,  -,Xk} 

CY  ■  {yi,  V2,  -,yn}  -»•  {yi,  V2, -,m} 
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Once  we  define  X  as  Cx{ X),  and  Y  as  Cy(D),  the  goal  of  co-clustering  algorithm  becomes 
finding  the  mapping  that  minimizes  the  mutual  information  loss  between  the  (A",  Y)  and  (X,  Y). 

Dai  et  al.  extended  Dhillon’s  co-clustering  algorithm  by  adding  the  minimization  from  the 
labeled  clusters  obtained  from  the  documents  in  the  labeled  domain  [24].  Following  is  the  cost 
function  used  by  the  authors  to  perform  the  co-clustering: 

/(D0;  W)  -  /(D0;  W)  +  X-  (/(C;  W)  -  I(C;  W))  (2.7) 

where  /  is  the  mutual  information  function.  D0  is  the  unlabeled  domain  documents,  ID  is  the 
word  counts,  D0  and  W  are  the  co-clustering  of  these  documents  and  words,  and  C  is  the  labeled 
clusters  obtained  from  the  labeled  domain  documents.  A  is  simply  a  weighting  parameter  that 
determines  the  weight  of  cost  function  part  that  depends  on  the  labeled  set  of  documents  i.e. 
I(C;  W )  —  I(C:  W).  In  the  paper,  they  equated  minimization  of  the  above  expression  with  the 
minimization  of  the  KL  divergence  (D(-  1 1  •))  between  the  distributions  as  follows: 

/(D0;  W)  -  /(D0;  W)  +  A  •  (/(C;  W)  -  I(C ;  ID)) 

=  D  (/(D0,  ID) 1 1 /(D0,  ID))  +  A  ■  D  (g(C,  W)\\g(C,  ID))  (2.8) 

where  /  and  g  are  the  joint  probability  distributions  after  the  co-clustering.  Their  method 
showed  a  significant  improvement  in  cross-domain  classification  accuracy  for  both  SRAA  and 
20-newsgroups  dataset  but  did  not  show  a  significant  accuracy  improvement  for  Reuters  dataset. 

Wang  et  al.  also  used  the  same  domain  definition  as  authors  in  [24],  however,  the  Reuters  dataset 
is  not  used  in  their  paper  [19].  Similar  to  the  paper  by  Dai  et  al.,  the  20-newsgroups  data  is  split 
into  two  domains  by  using  some  newsgroups  as  the  first  domain  and  some  similar  newsgroups 
as  the  second  domain.  The  methods  in  their  paper  augmented  the  feature  vectors  using  the 
Wikipedia  based  features  [19,  41].  The  empirical  results  showed  an  increase  in  accuracy  for  all 
pairs  of  domains.  They  did  not  discuss  the  number  of  features  that  were  added  to  achieve  this 
increase  in  the  accuracy. 

Swarup  et  al.  addressed  the  cross-domain  classification  problem  and  motivated  this  problem 
from  a  similar  point  of  view  as  ours  [42].  The  premise  was  that  since  humans  are  capable 
to  transferring  psychological  or  neurological  concepts  from  one  domain  to  another,  a  machine 
learning  algorithm  should  be  able  to  do  the  same.  However,  this  paper  lacked  any  formal  con- 
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structs  to  their  approach  and  any  robust  empirical  results.  This  paper  was  motivated  by  a  con¬ 
ceptual  cross-domain  learning  framework  based  on  human  learning.  The  structural  representa¬ 
tions  used  for  cross-domain  learning  are  made  from  multiple  neural  networks  and  correspond¬ 
ing  genomes  to  weight  the  neural  networks  differently  in  the  presence  of  different  domains. 
Although  this  paper  did  not  contain  any  formal  results  or  proofs,  it  presented  a  new  direc¬ 
tion  for  tackling  the  cross-domain  learning  by  developing  structures  that  can  mimic  so-called 
learning-to-learn  or  accomplish  knowledge  transfer  across  domains. 


2.5  Topic  Models  for  Cross-Domain  Classification 

Xue  et  al.  introduced  a  model  titled  Topic-bridged pLSA  that  uses  a  modified  probabilistic  latent 
semantic  analysis  (pLSA)  algorithm  to  generate  topic  models  for  cross  domain  text  classifica¬ 
tion  [43].  Authors  applied  this  extension  of  the  model  to  bridge  knowledge  across  domains. 
This  is  an  interesting  extension  of  pLSA;  specifically  for  cross-domain  classification.  Their 
definition  and  experiments  used  the  same  datasets  as  [19]  and  [24]  and  also  used  the  same  def¬ 
inition  of  domains.  The  authors’  model,  however,  was  not  dependent  on  this  definition  of  the 
domains  and  therefore  can  be  applied  to  any  two  sets  of  documents,  as  long  as  only  one  of  the 
sets  is  labeled. 


The  authors’  model  extended  the  basic  pLSA  model  in  two  ways.  Firstly,  they  split  the  likeli¬ 
hood  method  in  two  different  expressions,  one  for  the  likelihood  of  the  labeled  set  ( di )  and  one 
for  the  likelihood  of  the  unlabeled  set  (du).  This  way,  the  user  can  weigh  the  model  differently 
to  fit  labeled  or  unlabeled  data.  The  weight  of  the  two  sets  of  documents  is  determined  by  a 
weighting  parameters,  A.  The  combined  likelihood  was  obtained  from  the  following  equation: 


n  (■ w ,  d{)  log^  Pr  (di\z)  Pr  [z \w) 

w  L  di  z 

+  (1-A)£  ti  (zu j  du)  ^  ^  Pi*  {z\w) 

du  z 


(2.9) 


Secondly,  they  incorporated  the  label  information  from  the  labeled  set  of  documents.  This  is 
accomplished  by  adding  penalties  for  mismatch  between  the  topics  assigned  and  the  document 
labels.  These  penalties  include  assigning  different  topics  to  documents  under  the  same  label 
and  assigning  same  topics  for  documents  under  different  labels.  These  penalties  are  written  out 
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in  the  following  equations: 


fM4)  =  log^Pr(<ii|«)Pr(d/W  (2.10) 

z 

fM.4)  =  i°sE  PrWlz')PrMI^)  (2.11) 

zi¥lZj 


where  Pr(d\\z)Pr(d\\z)  represents  the  probability  that  two  documents  d]  and  dj  generated  by 
same  topic  z.  The  combined  likelihood  of  the  model  then  becomes: 

Lc  =  L  +  ft  Y,  /.(<?>  4)  +  ft  E  /<i(4  4)  (2.12) 

where  /3i  and  /?2  are  again  the  weighting  parameters  for  the  two  types  of  penalties.  The  experi¬ 
mental  results  did  show  an  improvement  over  the  algorithms  such  as  SVM  and  NB.  The  authors 
compared  the  classification  accuracy  obtained  from  classifying  documents  where  the  training 
and  test  set  belonged  to  two  different  domains.  Such  a  classification  is  particularly  challenging 
task  for  SVM  and  NB.  Therefore  an  improvement  in  the  classification  accuracy  over  SVM  and 
NB  classifiers  may  not  be  very  difficult  to  achieve. 

Zhai  et  al.  approached  cross-domain  learning  problem  from  yet  another  point  of  view  [44].  The 
authors  tackled  the  problem  of  finding  common  themes  among  different  domains.  Marx  et  al. 
and  Sarawagi  et  al.  have  also  discussed  a  similar  problem  in  their  papers,  [45,  46],  respectively. 
The  problem  of  finding  common  themes  is  an  interesting  problem  in  its  own  right,  but  it  is 
also  useful  to  us  since  the  results  of  their  paper  can  be  extended  to  other  problems  such  as  the 
cross-domain  classification  problem,  as  we  define  it  in  this  dissertation. 

Here  we  go  over  the  formulation  presented  in  [44]  in  more  detail;  even  though  their  goals  are 
different,  the  formulation  resembles  our  approach  for  cross-domain  classification  to  a  certain 
degree.  The  authors  modified  a  uni  gram  mixture  model  to  a  new  generative  model  that  helps 
in  finding  the  common  themes  across  different  collections.  Unigram  mixture  model,  defined  in 
Equation  2.13,  assumes  that  all  the  words  in  the  set  of  documents  belong  to  a  mixture  of  topics. 
To  generate  a  document,  w,  under  a  unigram  mixture  model,  one  first  picks  a  topic,  z,  from  the 
mixture  of  topics  and  then  iteratively  chooses  a  word  from  the  distribution  associated  with  that 
topic. 

N 

Pd(  w)  =  n  p(wn\z)  (2.13) 

2  n=  1 


19 


In  their  paper,  Zhai  et  al.  first  extended  the  model  by  adding  a  common  background  theme  to 
their  collection.  This  is  represented  by  dB  in  the  following  equation: 

k 

Pd(w )  =  A B  ■  p{w\0B)  +  (1  -  a b)  ^  [tt d,j  ■  p(w\@j)}  (2.14) 

3= 1 

where  dB  is  assumed  to  be  the  theme  that  represents  noise  or  general  words.  The  authors  used 
the  word  theme  to  mean  what  would  be  known  as  a  topic  under  the  topic  model  vocabulary.  In 
their  model,  all  the  other  words  are  picked  from  j  different  themes.  This  formulation  can  be 
seen  as  an  extension  of  the  unigram  model  that  encourages  all  the  general  words  to  be  put  into  a 
common  background  theme,  represented  by  dB.  A  represents  the  weight  of  background  versus 
specific  themes. 

The  authors  further  extended  this  model  and  split  the  theme  to  better  fit  the  framework  of  multi¬ 
domains.  Each  theme  was  split  into  a  theme  specific  to  a  given  domain  and  a  theme  common 
to  all  domains.  With  this  extension  they  obtain  the  following  formulation  for  unigram  mixture 
models: 

k 

Pd{w\Ci )  =  (1  -*';A B)  •  Y.  [?r d,j  (Ac  ■  p{w\9j)  +  (1  -  A c)  ■  p(w\6j4))\  +  XB  ■  p(w \0B)  (2.15) 

3= 1 


In  their  experiments,  the  authors  clustered  news  stories  that  contained  information  from  the 
Iraq  war  and  the  Afghanistan  war.  As  expected  from  their  model,  they  obtained  themes  that 
were  split  into  different  groups  based  on  their  prevalence  across  domains  i.e.  some  themes  were 
common  to  all  domains  while  others  were  specific  to  a  single  domain.  In  particular,  they  showed 
that  some  of  their  themes  contained  words  that  are  common  to  all  news  stories,  some  contained 
words  that  were  common  to  only  war  stories,  and  some  themes  were  specific  to  only  Iraq  war 
or  Afghanistan  war.  The  authors  were  successful  in  extracting  common  themes  across  different 
domains,  which  is  a  goal  that  is  closely  related  to  our  own  research. 

2.6  Conclusion 

This  chapter  discussed  the  related  work  that  included  (1)  classification  of  text  documents  using 
Wikipedia,  (2)  topic  models  for  supervised  learning,  (3)  using  unsupervised  topic  models  for 
classification,  (4)  methods  for  cross-domain  knowledge  transfer  and  (5)  topic  models  for  cross¬ 
domain  classification.  First  section  went  over  the  ways  of  using  Wikipedia  as  a  tool  to  aid 
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in  text  document  classification.  Second  and  third  sections  discussed  role  of  topic  models  in 
classification  from  two  points  of  view.  Firstly  (in  Section  (2)),  we  discussed  topic  models  that 
incorporate  the  labeled  data  to  learn  a  generative  model,  thus  these  topic  models  work  in  a 
supervised  learning  framework.  Secondly  (in  Section  (3)),  we  discussed  the  ways  of  using 
unsupervised  topic  models  for  classification  purposes.  Section  (4)  and  (5)  concentrated  on 
the  problem  of  cross-domain  document  classification,  where  Section  (4)  discussed  some  of  the 
general  approaches  to  solve  the  problem  and  Section  (5)  presented  ways  of  using  topic  models  to 
solve  the  problem.  As  we  can  see,  there  has  been  wide  variety  of  work  done  in  the  field  of  cross¬ 
domain  classification,  most  of  it,  however,  tackles  a  slightly  different  version  of  the  problem  of 
cross-domain  classification  than  ours.  The  previous  work  related  to  cross-domain  classification 
either  defines  the  word  domain  differently  or  does  not  apply  to  document  classification.  We 
introduce  new  methods  and  datasets  to  classify  documents  in  a  domain  without  requiring  any 
labeled  document  from  that  particular  domain  and  only  using  labeled  documents  from  another 
domain.  Our  work,  in  this  sense,  does  not  extend  any  one  or  two  specific  work  done  in  the  past 
but  extends  the  field  of  cross-domain  classification  as  a  whole. 
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CHAPTER  3: 
Background 


This  chapter  introduces  some  of  the  general  machine  learning  concepts  that  are  used  in  this 
dissertation.  Some  of  these  concepts  may  already  be  familiar  to  the  readers  proficient  in  the 
field  of  machine  learning  such  as  support  vector  machines  (SVMs)  and  Latent  Dirichlet  Al¬ 
location  (LDA)  topic  model,  while  some  other  concepts  discussed  in  this  chapter  are  a  little 
more  obscure  such  as  Bregman  divergences  and  their  relationship  to  exponential  family  of  dis¬ 
tributions.  In  this  chapter,  we  will  briefly  outline  some  common  classification  algorithms,  along 
with  a  discussion  on  EM-algorithm  and  some  common  topic  models.  We  will  then  introduce  the 
concept  of  exponential  family  of  distribution  and  their  relationship  with  Bregman  divergences. 
We  will  also  take  a  closer  look  at  Dirichlet  distribution  (a  member  of  exponential  family)  and 
KL-divergence  (a  member  of  Bregman  divergences). 

3.1  Classification  Algorithms 

Classification  algorithms  use  supervised  learning  techniques  to  classify  data  into  pre- determined 
categories.  Each  data  point  used  by  these  algorithms  is  presented  in  the  form  of  a  multi¬ 
dimensional  numerical  vector,  called  a  feature  vector.  For  example,  a  text  document  can  be 
made  into  a  ('/-dimensional  feature  vector  where  ith  entry  represents  the  count  of  the  ith  word  in 
the  vocabulary.  These  can  be  seen  in  Figure  3.1.  The  classification  algorithms  use  labeled  data 
as  a  training  set  to  learn  the  classification  model  and  then  apply  this  learned  model  to  classify 
unlabeled  data  from  the  testing  set.  In  this  manner,  classification  algorithms  can  be  seen  as  a 
mathematical  function  that  maps  data  feature  vectors  to  set  of  class  labels,  as  written  below: 

X  e  Rd 

Ye{l,2,...,k} 
f:X^  Y 

where  X  is  a  d-dimensional  real  valued  feature  vector.  Y  is  a  set  of  classification  labels  of 
integers  1  to  k.  f  is  the  classifier  function  that  maps  X  to  Y .  In  this  section,  we  will  go  over 
some  of  the  common  classification  algorithms  and  their  corresponding  classifier  functions,  /. 
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V 

Xn  documents 

Figure  3.1 :  A  text  document  can  be  made  into  a  d-dimensional  feature  vector  where  ith  entry  represents  the  count 
of  the  ith  word  in  the  vocabulary. 


3.1.1  Naive  Bayes 

Naive  Bayes  is  one  of  the  simplest  and  widely  used  classifier,  especially  for  text  documents. 
It  computes  the  probability  of  a  document  belonging  to  a  specific  class  given  the  counts  of 
the  words  in  the  document.  Naive  Bayes  is  named  naive  so  since  it  “naively”  assumes  the 
independence  among  the  word  counts  in  a  document.  It  treats  the  document  as  “bag  of  words,” 
by  not  incorporating  any  contextual  or  semantic  information  in  the  feature  vector  it  uses.  The 
impact  of  some  of  the  assumptions  commonly  made  in  a  typical  naive  Bayes  classifier  have  been 
a  topic  of  debate  and  many  improvements  have  been  proposed  in  the  literature  from  time  to  time 
[47,  8].  However,  even  with  these  assumptions,  the  classifier  generally  performs  very  well  in 
practice  [48].  The  naive  assumption  that  all  features  are  independent  simplifies  the  posterior 
probability  equations.  Under  this  assumption,  one  can  compute  the  posterior  probability  by 
using  only  class  prior  probabilities  and  probability  for  each  of  the  features  independently.  The 
details  of  the  formulations  used  in  this  classifier  are  described  in  the  following  paragraph. 

Let’s  assume  that  there  are  D  documents  in  our  training  set.  Each  of  the  document,  d,  in  the 
collection  can  be  represented  by  a  feature  vector,  w,  containing  the  counts  of  each  word  in 
the  vocabulary.  The  dimension  or  the  length,  v,  of  this  feature  vector  is  the  size  of  the  entire 
vocabulary  and  the  entry  w[i]  represents  the  number  of  times  ith  word  appears  in  this  document. 
With  this  information,  we  can  compute  the  prior  probabilities  of  each  feature  (or  word)  given 
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a  class.  This  probability  can  be  computed  by  simply  counting  the  number  of  times  that  word 
occurs  in  the  documents  of  that  class  from  the  training  set.  Given  the  independent  assumption 
among  the  words,  the  probability  of  a  document  D  given  the  class  C  is: 

P(D\C)  =  Y[p(wi\C)  (3.1) 


where  p(wi\C)  is  the  probability  of  the  i'll  word  given  the  class  C.  Using  the  Bayes  Theorem: 


P(A\B) 


P(B\A)P(A ) 
P(B) 


(3.2) 


we  can  compute  the  probability  of  class  C  given  the  document  D,  P{C\D),  as  follows: 


P(C\D) 


P(D\C)P(C ) 

P(D) 


(3.3) 


Using  the  above  formulation,  given  a  feature  vector  x  e  Mu,  containing  word  counts  and  a  set 
of  j  different  categories  Y  =  (1, ...  ,j},  the  classifier  assigns  category  to  the  document  x  that 
maximizes  the  probability  as  shown  below: 


f(x)  =  argmaxP(Yj\x)  =  P(x\Yj)P(Yj)  (3.4) 

j 

f  is  the  classifier  function  that  is  being  used  to  map  xeR”  onto  Y. 

3.1.2  K-Nearest  Neighbors 

The  k-nearest  neighbor  is  also  a  supervised  learning  algorithm.  It  is  one  of  the  simpler  and 
widely  used  machine  learning  algorithms  that  is  similar  to  naive  Bayes.  The  object  is  to  classify 
a  given  d-dimensional  point  x  into  a  class  j  E  Y.  The  algorithm  is  explained  in  detail  in  the 
next  paragraph. 

Let’s  assume  that  the  algorithm  is  given  a  training  set  of  n  documents  represented  as  d-dimensional 
feature  vectors  that  belong  to  c  different  classes.  To  classify  an  unlabeled  document,  represented 
as  vector  x,  one  first  computes  x’s  distance  to  each  one  of  points  in  the  training  set.  She  then 
selects  the  k- nearest  neighbors  of  x  in  the  training  set.  The  label  assigned  to  x  is  the  label  of 
majority  of  its  k- closest  neighbors  in  the  training  set.  If  k  =  1,  x  is  assigned  the  same  class  as 
the  class  of  its  nearest  neighbor  in  the  training  set.  To  compute  the  distances  from  the  test  point 
x  to  the  points  in  the  training  set,  any  arbitrary  distance  measure  such  as  Euclidean  distance, 
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Cosine  similarity  measure  and  KL-distance  measure  etc.,  can  be  used.  The  choice  of  the  pa¬ 
rameter  k  can  also  depend  on  the  size  of  the  training  set,  number  of  classes  and  can  be  adjusted 
based  on  cross-validation  results.  In  our  dissertation,  unless  otherwise  specified,  we  choose  the 
k  to  be  31. 

The  classifier  function,  /,  can  be  obtained  for  /c -nearest  neighbor  algorithm  with  k  =  1  as  fol¬ 
lows.  Given  a  feature  vector  x  €  Wl,  containing  word  counts  and  a  set  of  j  different  categories 
Y  =  {1, . . . ,  j},  and  a  distance  metric  s,  the  classifier  assigns  category  to  the  document  x  that 
is  equal  to  the  category  of  its  k  closest  neighbors,  as  shown  below: 

b  =  mirij{s(mj,x )} 

/(x)  =  label(b ) 

where  b  is  the  nearest  neighbor  of  x  in  the  training  set  of  documents.  Function  /  is  the  classifier 
function  to  map  x  e  W1  onto  Y. 

3.1.3  Support  Vector  Machines 

Support  Vector  Machines  (SVMs)  were  developed  by  Vapnik  et  al.  [49]  in  1992.  SVMs  are  a 
type  of  linear  classifiers  similar  to  a  single  layered  perceptron,  however,  SVMs  find  a  separating 
hyperplane  that  has  the  maximum  margin  between  the  two  classes  of  data  points.  Since  their 
introduction,  SVMs  have  proven  to  be  very  effective  in  wide  array  of  classification  tasks,  espe¬ 
cially  in  text  classification  [50].  A  brief  formulation  of  the  cost  function  that  SVMs  minimize 
is  described  in  the  following  paragraph. 

Given  a  training  set  of  instance-labeled  pairs  (x^y^),  i  =  1, . . . ,  l  where  x*  e  Rd  and  y  e 
{1,  — 1}Z,  the  support  vector  machines  (SVM)  [51,  6]  find  the  hyperplane,  with  normal  vector 
w,  by  optimizing  the  following  function: 

^wTw  +  C^2£i 

i=  1 

yi(wT0(xi)  +  b)  >  1  -  (3.5) 

&>o. 

This  formulation  results  in  the  margin  of  the  resulting  hyperplane  to  be  a  =  l/||w||.  The  first 
constraint  requires  all  training  set  data  to  be  on  the  “right”  side  of  the  hyperplane  except  a  few 


min 

w,6,£ 


subject  to 
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points  that  can  be  on  the  “wrong”  side  of  the  hyperplane  within  some  slack  value  denoted  by  £. 
If  the  ith  training  example  lies  on  the“wrong”  side  of  the  hyperplane,  we  get  the  corresponding 
>  1.  Because  the  term  C  fff‘=x  Ci  in  die  cost  function  that  is  to  be  minimized,  C  denotes 
a  parameter  that  weighs  the  training  data  point  errors  for  the  final  solution.  In  other  words,  C 
allows  trading  off  training  error  vs.  model  complexity.  The  optimal  value  of  this  parameter 
is  up  to  the  user  and  is  often  chosen  based  on  the  results  of  cross-validation  or  by  some  other 
model  selection  strategy.  In  our  experiments,  unless  otherwise  stated,  C  is  chosen  to  be  equal 
to  1  for  all  cases. 

Once  the  separating  hyperplane  w  and  threshold  b  is  determined,  we  can  use  the  following 
classifier  function,  /: 

/(x)  =  sgn{ w  •  x  +  b}  (3.6) 

/  is  used  to  map  x  e  onto  Y. 

3.2  EM  Algorithm  and  Topic  Models 

This  section  introduces  Expectation-Maximization  (EM)  algorithm  and  two  common  topic  mod¬ 
els.  EM  is  a  common  soft-clustering  algorithm,  where  the  parameters  of  the  underlying  distri¬ 
bution  are  determined  such  that  the  expectation  (likelihood)  of  the  observed  is  maximized.  The 
underlying  distribution  is  often  assumed  to  be  a  mixture  of  multiple  distributions  e.g.,  mixture 
of  Gaussians,  mixture  of  multinomials  etc.  Topic  models  are  mixture  models,  where  documents 
are  assumed  to  be  generated  from  a  mixture  of  distributions.  Most  topic  models  use  the  EM  al¬ 
gorithm  to  find  the  parameters  of  the  unobservable  underlying  distributions.  There  are  various 
different  topic  models  that  have  been  proposed  in  the  literature  such  as  probabilistic  Latent  Se¬ 
mantic  Analysis  (pLSA)  and  Latent  Dirichlet  Allocation  (LDA)  etc.  These  topic  models  differ 
in  their  underlying  assumptions  about  the  type  of  distributions,  the  role  of  prior  probabilities  etc. 
In  topic  models,  a  topic  is  defined  as  a  distribution  over  the  words  in  the  vocabulary.  Following 
subsections  will  introduce  these  three  concepts  in  more  detail. 

3.2.1  Expectation-Maximization  Algorithm 

Expectation-Maximization  (EM)  algorithm  is  often  the  algorithm  of  choice  for  finding  param¬ 
eters  of  a  distribution  that  maximizes  the  likelihood  of  the  observed  data.  Let  A"  be  a  set  of 
multidimensional  vectors  representing  the  observed  data,  and  let  6  be  the  parameters  of  the  dis¬ 
tribution  of  X.  The  goal  of  the  EM  algorithm  is  to  find  the  9  that  maximizes  the  likelihood 
of  A".  For  the  sake  of  convenience,  we  choose  to  maximize  log  likelihood  function  which  is 
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defined  as  follows: 


L{6)  =  lnP(X|0)  (3.7) 

In  some  trivial  cases,  the  estimation  of  the  data  that  maximizes  this  likelihood  has  close-formed 
solution.  For  example  for  an  observed  data  that  is  assumed  to  be  distributed  according  to  single 
normal  distribution  having  two  parameters,  /i  and  a2,  the  optimal  value  of  //  is  simply  the 
average  of  X,  and  the  optimal  value  of  the  a2  is  the  variance  of  X.  However,  it  is  often  the  case 
where  the  assumed  underlying  distribution  is  not  as  simple  as  single  Gaussian  but  consists  of 
a  mixture  of  distributions.  To  find  parameters  of  such  a  mixture  of  distributions,  we  introduce 
hidden  variables  to  make  the  maximum  likelihood  estimation  of  9  tractable.  We  represent  these 
hidden  variables  by  z  e  Z. 

The  joint  distribution  of  X  and  z  is 

P(X,z\9)  =  Y[P(xl\z,9)P(zi\9)  (3.8) 

i 

The  likelihood  expression  becomes: 

P(X|0)  =  \{YJp^\zi,9)P(zi\9)  (3.9) 

i  z 

Since  we  are  interested  in  maximizing  the  log-likelihood,  we  take  the  log  of  3.9  as  follows: 

L(9)  =  log(P(X|0)  =  J>g  ^P(xil^)PM)  (3.10) 


This  is  not  an  easy  equation  to  maximize  (given  the  log  of  sums),  therefore  we  introduce  a 
distribution  on  z\s,  q(zi),  as  follows: 


L(0)  = 

i  z 

L (0)  >  q (*) log 


Pjx-ilzi,  9)\P(zj,  9) 
q(zi) 

P(xi,  Zi\9) 


(3.11) 

(3.12) 


the  final  inequality  in  the  previous  is  reached  using  the  following  equation  along  with  Jensen’s 
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inequality. 


(3.13) 


p(xi,Zi\ey 

?(*0 

We  apply  the  Jensen’s  inequality  that  states  for  a  concave  function  /: 

f{E{x))  >  E(f(x ))  (3.14) 

Since  log  is  a  concave  function,  we  obtain  this  lower  bound  on  L{9). 

We  further  show  that  q(zi)  can  be  made  equal  to  L(0)  for  a  fixed  9n  by  setting 


“  q{zi) 


q{zi )  =  P(zi\xi,0n) 


(3.15) 


After  showing  that  q(zi)  is  a  lower  bound  on  L{9)  and  is  equal  to  L{9)  at  9n,  we  maximize  q{zj) 
with  respect  to  6 

9(n+ 1)  =  arg  max  ^  ^  q(zi ) - — —  (3.16) 

*  2  q^Zi> 

3.15  and  3.16  are  called  the  E-step  and  M-step  of  the  EM  algorithm  respectively.  EM  algorithm 
maximizes  the  likelihood,  L{9),  by  repeating  the  E-step  and  M-step  alternatively,  and  thus 
iteratively,  obtaining  next  best  parameter  6n+i  from  9n. 

EM  algorithm  forms  the  basis  of  many  algorithms  that  require  optimizing  parameters  of  a  dis¬ 
tribution  with  hidden  variables.  The  next  two  sections  will  introduce  two  topic  models  that  also 
use  EM  algorithm  to  find  the  parameters  of  the  underlying  distribution  described  by  a  Bayesian 
network  having  hidden  variables. 


3.2.2  Probabilistic  Latent  Semantic  Analysis 

Probabilistic  latent  semantic  analysis  (pLSA)  was  proposed  in  1999  by  Hoffman  et  al.  [52]. 
pLSA  is  a  mixture  model  that  assigns  multiple  topics  to  a  single  document.  Each  document 
is  assumed  to  be  generated  from  multiple  topics.  To  generate  each  word  in  a  document  under 
pLSA,  one  first  picks  a  topic  from  the  set  of  topics  and  then  generates  the  word  according  to  the 
multinomial  distribution  as  described  by  that  topic.  The  probability  of  a  topic  is  also  dependent 
on  the  document  itself  and  is  written  as  p(z  \  d),  where  z  is  the  topic  and  d  is  the  document. 
This  formulation  results  in  the  following  equation  for  the  joint  distribution  of  the  document  and 
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Figure  3.2:  pLSA  represented  as  a  Bayes  network  using  the  plate  notation 


the  words. 


p(d ,  w)  =  p{d )  E  p{wn  |  z)p(z  |  d) 

Z 


(3.17) 


where  z  e  Z  is  a  multinomial  distribution,  in  other  words  z  is  a  particular  topic.  In  pLSA,  each 
document  has  a  topic  distribution  over  Z  associated  with  it.  Probability  of  the  ith  word, 
is  computed  by  adding  up  the  probabilities  of  that  word  in  different  topics,  weighed  by  the 
probability  of  each  topic  in  the  document.  Thus  to  generate  a  new  document  under  pLSA,  we 
need  a  distribution  of  the  topics,  which  are  associated  with  old  documents.  Therefore,  pLSA 
will  generate  a  new  document  that  is  similar  to  one  of  the  given  documents  in  the  training  set. 
In  this  way,  pLSA  can  be  seen  as  a  pseudo-generative  model  i.e.  it  can  only  generate  a  new 
document  based  on  a  document  from  the  training  set.  Figure  3.2  shows  a  plate  notation  diagram 
for  the  pLSA  model. 


3.2.3  Latent  Dirichlet  Allocation 

Latent  Dirichlet  Allocation(LDA)  is  similar  to  the  pLSA  as  discussed  earlier.  It  is  also  a  gen¬ 
erative  model  for  words  in  documents.  In  LDA,  the  model  learns  k  topics,  where  k  can  be  any 
integer.  Since  a  topic  is  defined  as  a  distribution  over  the  words,  the  k  topics,  learned  by  LDA, 
are  k  different  multinomial  distributions  of  the  words  in  the  vocabulary.  The  vocabulary  is  a 
set  of  all  the  words  used  in  the  training  set  of  documents.  LDA  does  tend  to  learn  clustering  of 
words  that  form  semantic  themes  (words  describing  the  same  concept  will  be  assigned  to  the 
same  topic).  A  common  measure  to  compare  topic  models  is  the  likelihood  of  the  held  out  doc¬ 
ument  set,  known  as  perplexity  measure.  In  the  case  of  LDA,  Blei  et  al.  show  that  the  perplexity 
measure  is  higher  in  a  model  learned  using  LDA  compared  to  pLSA  [11].  They  hypothesize 
that  this  increase  in  the  perplexity  may  be  due  to  the  fact  that  LDA  has  a  Bayesian  prior  over 
the  word  distribution  over  topics.  Lack  of  this  prior  in  pLSA  makes  it  over-learn  the  training 
set,  thus  reducing  the  perplexity  of  the  new  set  in  the  model  given  by  pLSA. 
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Given  a  document  as  a  collection  of  words,  LDA  distributes  them  over  k  different  topics  and 
represents  a  document  as  a  proportion  of  these  topics.  LDA  model  can  be  described  by  two 
parameters,  namely,  a  and  [3.  a  is  the  parameter  for  the  Dirichlet  distribution  that  is  used  to 
generate  the  topic  distributions  for  individual  documents,  (3  is  a  set  of  multinomial  distributions 
over  the  words  for  individual  topics.  Given  the  model  parameters  {/3,  a},  to  generate  a  doc¬ 
ument  with  N  words  W  =  { w , ,  w2,  ■  ■  ■ ,  v'n}  under  LDA,  we  take  the  following  generative 
process: 

•  sample  topic  proportion  9  ~  Dirichlet  (a), 


•  For  each  word  wn  G  {w\ , . . . ,  wn}, 

-  sample  a  topic,  zn  from  9, 

-  sample  the  word,  wn  from  the  set  of  multinomial  distributions  (3  for  the  topic  zn 


The  Equation  3.18  gives  the  expression  for  the  probability  of  w,  where  w  is  a  word  vector. 
Figure  3.3  shows  the  LDA  model  in  the  plate  notation  of  the  Bayes  networks. 


N 


p(w  |  a,/3)  =  /  p(6  |  a) 


9)p(wn  j  zn,/3)  d9 


(3.18) 


V.n=l  zn 


Equation  3.19  provides  a  more  detailed  version  of  Equation  3.18  by  expanding  the  probability 
of  9,  p{9  |  a),  as  obtained  by  the  Dirichlet  probability  density  function. 


p(w|a,  jS) 


F(S,  a.) 
n.r(a,) 


N 


V 


n  e  n  (w 

^n=l  i=  1  j= 1 


d9 


(3.19) 


Even  though  this  is  an  unsupervised  model,  it  can  be  used  for  classification  purposes  as  it 
provides  a  different  view  of  the  documents  as  a  distribution  over  different  topics.  One  can 
use  these  k  topics  to  compare  different  collection  of  documents.  A  document,  in  LDA,  is 
represented  as  a  distribution  over  these  topics,  one  can  generate  such  a  representation  for  a  set 
of  documents,  where  some  of  the  documents  in  the  set  are  labeled  and  some  are  not  labeled. 
The  labeled  set  of  documents  can  then  be  used  to  assign  labels  to  the  unlabeled  set  by  training 
a  classifier,  such  as  SVM,  on  the  topic  distribution  representations  of  all  the  documents. 
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Figure  3.3:  LDA  represented  as  a  Bayes  network  using  the  plate  notation 


In  the  next  section,  we  will  introduce  the  concept  of  exponential  family  of  distribution,  which  is 
essential  in  formulating  some  of  the  distance  metrics  that  we  use  in  our  classification  algorithms. 


3.3  Exponential  Family  of  Distributions 

In  this  section,  we  review  the  exponential  family  which  is  a  set  of  distributions.  Exponential 
families  make  up  an  important  class  of  probability  distributions.  In  addition  to  having  some 
nice  algebraic  properties,  they  appear  to  be  the  natural  distributions  to  consider.  These  families 
include  both  discrete  as  well  as  continuous  distributions. 

Definition:  Pupfi)  is  an  exponential  family  distribution,  if  it  can  be  written  in  the  following 
form.  Let  T  be  a  convex  set  in  Rd.  6  e  V  is  the  natural  parameter. 


P^x  =  exp(0(x)  •  9  -  if  (9)  -  A(x)) 


(3.20) 


where  if  (9)  is  a  normalization  constant  which  is  differentiable  on  int(T),  and  it  is  called  log- 
partition  function.  0 ( x )  is  called  as  sufficient  statistics  [53]. 

As  an  example,  the  Poisson  distribution  belongs  to  the  exponential  family.  It  is  normally  written 
as, 


P(x|^)  = 


lixe  ^ 

x\ 


where  //  is  the  parameter  of  the  Poisson  distribution. 
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Table  3.1:  Table  showing  some  common  exponential  family  distributions  with  their  natural  parameter  and  log- 
partition  function. 


Pip,e{x) 

e 

m 

Distribution 

px  —  p\0-~x) 

(x,  1  —  x) 

(log p,  log  (1  -p)) 

log  ( edl  +  e 02 ) 

Binomial 

\xe~x 

x\ 

X 

log  A 

e 9 

Poisson 

(x,  x2) 

(JL _ 5_\ 

\  ct2  ’  2<t2  t 

&  +  \  log(2vror2) 

Gaussian 

The  Poisson  distribution  can  be  reduced  to  the  exponential  family  form  as  follows: 


P(x|/x) 


x\ 

jLgZlog  Mg-M 

a;! 

g-log(a;!)ga;log/tg -fj, 
£x  log  fi— fi—log(x\) 


Following  the  exponential  family  form,  we  get 

e  =  log  (fj) 

=  e9 

X(x)  =  log(x!)  (3.21) 

Table  3.1  lists  some  of  these  common  exponential  family  distributions  such  as  Gaussian,  Bi¬ 
nomial,  Beta,  Multinomial,  and  Dirichlet  distributions,  etc.  and  their  characteristic  parameters. 
Exponential  family  of  distributions  has  a  number  of  convenient  properties.  Here  we  list  four  of 
these  important  properties  of  exponential  family  of  distributions.  We  will  discuss  these  proper¬ 
ties  in  detail  as  they  relate  to  our  research. 


•  The  first  important  property  of  exponential  family  of  distributions  is  the  existence  of 
conjugate  priors.  The  conjugate  prior  of  a  distribution  is  another  distribution  over  its 
parameters.  In  the  case  of  exponential  family,  the  conjugate  prior  is  also  a  member  of 
exponential  family.  Furthermore,  given  any  member  of  the  exponential  family  according 
to  the  Equation  3.20,  the  conjugate  prior  distribution  can  be  expressed  in  the  following 
form: 

p(0\a,  f3)  =  m(a ,  /3)  exp ((9,  a)  —  /3ip(9))  (3.22) 
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where  a  and  /3  are  hyperparameters  of  the  conjugate  prior.  Importantly,  the  function  ip(-) 
is  the  same  between  the  exponential  family  member  and  its  conjugate  prior. 

•  The  second  important  property  of  the  exponential  family  of  distributions  is  bijection  be¬ 
tween  exponential  family  members  and  Bregman  divergences.  This  property  means  that 
each  exponential  family  member,  described  by  a  convex  log-partition  function  ip  (9),  has 
a  corresponding  Bregman  divergence  associated  with  it  described  by  a  convex  function, 
0.  Furthermore,  the  two  convex  functions  ip  and  <p  are  Legendre  conjugates  of  each  other. 

•  The  third  important  property  of  the  exponential  family  of  distributions  is  the  one-to-one 
mapping  between  the  canonical  parameters,  9,  and  the  so-called  mean  parameters  which 
we  denote  by  p.  For  each  canonical  parameter  0  G  0,  there  exists  a  mean  parameter 
p  G  M,  where  M  can  be  defined  as: 

M.  :=  j/U  G  Rd  :  p  =  I  4>(x)p(x;  9)dx  W9  G  ©1  (3.23) 


Furthermore,  0  and  M.  are  dual  spaces  in  the  sense  of  Legendre  duality.  In  Legen¬ 
dre  duality,  we  know  that  two  spaces  0  and  JA  are  dual  of  each  other  if  for  each  6  G 
Q,Vip(9)  =  p  G  M.  This  duality  yields  the  following  relationship  between  the  log- 
partition  function  ip  (9)  and  the  sufficient  statistics  function  of  o(x)\ 

Vip(6)  =E  (<p(x))  =  p  (3.24) 

Vcp(p)  =9  (3.25) 


•  The  fourth  important  property  of  the  exponential  family  of  distributions  is  that  the  expo¬ 
nential  families  arise  naturally  when  we  look  for  a  maximum  entropy  distribution  con¬ 
sistent  with  given  constraints  on  the  expected  values.  For  example,  for  a  non-negative 
random  variable  with  an  expected  value  of  —  1/A,  the  maximum  entropy  distribution  is 
the  exponential  distribution  and  for  any  random  variable  with  a  known  mean  and  vari¬ 
ance,  the  maximum  entropy  distribution  is  the  normal  distribution.  We  will  now  define 
the  maximum  entropy  solution  a  bit  more  formally.  The  entropy  of  a  random  variable  X 
is  defined  as: 


H(X)  =  ~^2p(x)\ogp(x) 

x 
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When  we  find  a  probability  distribution  p*(x)  that  maximizes  H(x)  while  satisfying  the 
following  constraints  on  the  expected  values,  we  get  a  unique  p*(x)  which  belongs  to  an 
exponential  family  of  distributions  [54]. 

J ^P(x)fi{x )  =  ai 
x 

=  1 


where  exponential  family  is  defined  previously  as  Equation  3.20. 


3.3.1  Dirichlet  Distribution 

The  Dirichlet  distribution  is  a  member  of  exponential  family  of  distributions.  It  is  a  multivariate 
distribution  over  positive  real  numbers.  Let  a i . . .  ak  be  parameters  of  Dirichlet  distributions 
such  that  >  0  for  i  =  1 ...  K.  A  vector  (aq, . . . ,  xk),  where  x,:  >  0  for  i  =  1 . . .  K  and 
Y^iLi  —  1  wiU  be  distributed  according  to  Dirichlet  distribution  denoted  as  (aq,: . . .  ,xk)  ~ 
D(oti, . . . ,  «fc).  Samples  taken  from  a  I\  dimensional  Dirichlet  distributions  lie  on  a  K  —  1 
simplex.  The  probability  of  vector  aq  . . .  xk  under  Dirichlet  distribution  can  be  written  as 


P(x i  . . .  Xk)  =  Dir(oq  . . .  xk\ «) 


1 

BM 


K 


if': 


Xi  >  0,  Xi  =  1;  oii  >  0 

i 


(3.26) 

(3.27) 


where  a*  is  the  ith 
defined  as  follows 


element  of  the  parameter  vector  a.  The  normalization  constant  /i(o  )  is 


B(a)  = 


nfi  r(q.) 
r  (Ef=i «.) 


(3.28) 


The  T  function  is  the  gamma  function.  It  can  be  thought  of  as  a  factorial  function  extended  to 
real  numbers.  The  gamma  function  is  defined  as  follows 


rw 


tx  xe  tdt 


(3.29) 
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0 


alpha  =  [1  1  5] 


The  expected  value  and  the  variance  of  Dirichlet  distribution  is  defined  as  follows: 


E[xi\ 


Var[xi] 


OLi 

«0 

ai(a0  -  Qj) 
&o(aO  +  1) 


E[xj}(  1  -  E\xj\) 

OlQ  +  1 


(3.30) 

(3.31) 


where,  a0  =  ,  a,.  From  this,  we  can  see  that  the  variance  of  the  Dirichlet  distribution  is 

inversely  proportional  to  the  a0-  Figure  3.4  shows  the  contours  of  the  Dirichlet  distribution  for 
different  values  of  a. 


Relation  Between  Multinomial  and  Dirichlet  Distribution 

Dirichlet  distribution  is  among  the  most  important  distributions  because  Dirichlet  distribution 
is  the  conjugate  prior  for  the  parameters  j3i . . .  /3k  of  a  discrete  multinomial  distribution  [55]. 

This  property  of  Dirichlet  distributions  being  the  conjugate  prior  for  multinomial  distributions 
can  be  used  to  find  the  posterior  parameter  of  a  multinomial  distribution  given  a  prior  and 
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observed  data.  Specifically,  if  the  prior  over  parameter  of  a  multinomial  distribution  is  given  by 
a,  then  in  the  presence  of  observed  data  x, . . . ,  x fe,  we  get  the  following  relationships: 

/3  ~  Dir  (a) 

xi, . . . ,  Xk\/3  ~  Multi((3 ) 

/3|x  ~  Dir(a  +  x) 

where  x  is  the  observed  value  vector,  xi, ,  X}-.  and  (3  is  a  parameter  vector  of  the  multinomial 
distribution  [56]. 


Standard  exponential  family  form  of  Dirichlet  Distribution 

It  is  easier  to  identify  the  natural  and  expectation  parameter  of  Dirichlet  distribution  by  writing 
it  in  its  standard  form.  The  Dirichlet  distribution  can  be  written  in  exponential  family  form  as 


f 

Ml  —  1 

T 

log  pi 

K  \ 

Dir(pi  . .  ,px  u)  =  exp 

+  i«nui- Vr-iiii) 

V 

UK  ~  1 

log  Pk 

w  / 

where  U  =  Ylt=i  u>  ■  The  expectation  under  P  of  the  natural  statistic  vector  is 


log  Pi 

\  - 

Hui)  -4>(u) 

log  Pk 

I  ■ 

p 

V>( uk )  -  ^(u) 

(3.32) 


(3.33) 


where  the  digamma  function  -0(-)  is  as  defined  previously. 


Relation  Between  Gamma  and  Dirichlet  Distribution 

Dirichlet  distributions  also  have  relationship  with  other  methods  of  exponential  family  of  dis¬ 
tributions  such  as  Beta  and  Gamma  distributions  [57].  Gamma  distribution  can  be  used  to 
generate  samples  from  a  given  Dirichlet  distribution.  Gamma  distribution  has  two  parameters, 
k  and  9,  which  are  also  known  as  shape  and  scale  parameters  respectively.  The  probability 
density  function  is  defined  as  follows: 


f(x;  k,  9) 


=  x 


k- 1 


e  » 


9kT(k) 


for  x  >  0  and  k,9  >  0 


(3.34) 
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We  obtain  the  following  relationship  between  Dirichlet  and  Gamma  distributions.  If  yds  are 
independently  distributed  according  to  Gamma  distribution  with  parameters  and  9  for  i  = 
1, . , . ,  k,  that  is,  y,  ~  Gamma{a 9)  independently  then 

k  k 

V  =  ~  Gamma(a0 ,  9)  where  a0  =  ^  (3.35) 

i= 1  i=  1 

then 

A"  =  (xi,  ...,xk)=  ~  Dir{ah  ...,ak)  (3.36) 

This  property  of  Gamma  and  Dirichlet  distributions  is  often  used  to  collect  samples  from  a 
given  Dirichlet  distribution. 

3.4  Bregman  Divergences 

As  we  mentioned  in  the  previous  section,  some  common  families  of  probability  distributions 
-  such  as  Gaussian,  Binomial,  and  Poisson  -  are  exponential  families.  This  formalism  has 
turned  out  to  be  very  powerful  in  statistics  and  machine  learning.  This  framework  is  general 
enough  to  include  many  distributions  of  interest  (such  as  the  distributions  which  factor  over 
a  specified  undirected  graph)  while  at  the  same  time  being  specific  enough  that  it  implies  all 
sorts  of  special  properties.  Furthermore  each  exponential  family  has  a  natural  distance  measure 
associated  with  it.  In  the  case  of  spherical  Gaussians,  it  is  perhaps  obvious  that  this  distance 
measure  is  squared  Euclidean  distance,  because  the  density  at  any  given  point  is  determined  by 
its  squared  Euclidean  distance  from  the  mean. 

As  another  example,  in  the  multinomial  distribution,  it  can  be  checked  that  the  density  of  a  point 
depends  on  its  KL-divergence  from  the  mean.  Therefore,  KL  divergence  is  the  natural  distance 
measure  of  the  multinomial.  Notice  that  it  is  not  a  metric,  i.e.,  it  is  not  symmetric  and  does  not 
satisfy  the  triangle  inequality.  However,  as  we  will  see,  it  is  well-behaved  in  some  ways  and  has 
a  lot  in  common  with  squared  Euclidean  distance. 

The  various  distance  measures  underlying  different  exponential  families  are  collectively  known 
as  the  Bregman  divergences  [58,  59].  We  give  the  formal  definition  of  these  divergences,  which 
does  not  follow  the  intuition  about  exponential  families  but  rather  associates  each  divergence 
with  a  specific  convex  function. 

Definition:  Let  f  :  S  — *  R  be  a  strictly  convex  function  which  is  defined  on  a  convex  domain 
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Figure  3.5:  Bregman  distance  between  points  x  and  y  gives  us  the  first  order  approximation  error  when  0(x)  is 
estimated  using  point  y. 


S  C  Rd  and  is  differentiable  on  the  interior  of  S.  The  Bregman  distance  D0  :  S  x  int(S)  — > 
[0,  oo)  is  then  defined  by 


D^x,  y)  =  0(x)  -  0(y)  -  V0(y)  •  (x  -  y) 


(3.37) 


A  pictorial  representation  of  Bregman  distance  between  two  points,  x  and  y ,  is  shown  in  Figure 
3.5.  As  mentioned  earlier,  some  of  the  common  distances,  that  are  often  used,  belong  to  the 
family  of  Bregman  distances.  For  example,  choosing  0  =  |||x||2  gives  77^(x,  y)  =  |||x  —  y||2, 
which  is  the  squared  Euclidean  distance.  The  derivation  of  this  distance  measure  is  shown 
below. 


2 


(3.38) 
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Similarly,  0(x)  =  Yn=\  x>  log  x%  gives 


d 

A*(x,  y)  =  ^  lo§  “  _  x*  +  y?;’  (3-39) 

i  Vi 

i=l  i  i 

which  is  a  generalization  of  KL-divergence.  It  can  be  easily  seen  that  this  generalization  reduces 
to  the  regular  definition  of  KL-divergence  when  x  and  y  are  probability  measures  and  therefore 
sum  to  one.  The  next  subsection  will  discuss  KL-divergence  in  a  little  more  detail. 

3.4.1  KL-Divergence:  A  Closer  Look 

KL-divergence  is  regarded  as  the  “distance”  between  two  distributions.  For  two  distributions 
p{x)  and  q(x),  the  KL-divergence  between  them  is  defined  as: 

DKL(p\\q)  =  ^p{x)  log^j  (3.40) 

Notice  that  the  KL-divergence  between  p  and  q  is  not  symmetric  and  does  not  obey  the  triangle 
inequality.  As  we  saw  earlier,  KL-divergence  is  a  Bregman  divergence.  So  it  is  always  positive 
and  equals  to  0  only  if  p  =  q.  Support  for  the  KL  distance  as  the  true  distance  between  two 
distributions  comes  from  the  concepts  of  information/coding  theory.  In  coding  theory,  the  en¬ 
tropy  of  a  random  variable  X  can  be  thought  of  as  the  average  number  of  bits  to  represent  it. 
Dkl{p\\q)  can  be  thought  of  as  the  number  of  bits  that  will  be  wasted  by  encoding  events  of  p 
by  using  code  that  is  optimized  for  q. 

3.5  Bregman  Divergences  as  Natural  Distances  for  Exponen¬ 
tial  Distributions 

In  this  section,  we  will  describe  the  relationship  between  Bregman  divergences  and  Exponential 
families  of  distributions  more  formally.  We  mentioned  earlier  that  each  exponential  family 
has  a  natural  distance  measure  associated  with  it,  and  that  distance  is  a  Bregman  distance. 
The  relationship  clearly  brings  out  how  Bregman  divergences  are  in  fact  the  natural  distance 
measures  for  exponential  families.  In  essence,  the  probability  density  at  any  point  x  given  by 
an  exponential  family  distribution  is  directly  proportional  to  some  Bregman  distance  of  x  from 
the  mean,  //.  In  other  words,  the  density  of  an  exponential  family  distribution,  PiP:o(x)  can  be 
written  as 

iV*(x)  oc  e~D^  (3.41) 
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Figure  3.6:  Distance  contours  of  some  common  Bregman  distances  that  also  denote  the  shape  of  equal  density 
contours  for  the  corresponding  exponential  family. 


where  D  •)  is  a  Bregman  divergence  function.  The  relationship  between  functions  0  and  0 
is  discussed  further  ahead  in  the  section.  Figure  3.6  shows  some  contours  of  a  few  common 
Bregman  distances  that  also  denote  the  shape  of  equal  density  contours  for  the  corresponding 
exponential  family. 

We  will  now  give  a  formal  description  of  this  relationship.  A  detailed  analysis  on  this  topic 
has  been  done  in  the  “Clustering  with  Bregman  Divergences”  paper  [59].  There  is  a  bijection 
between  exponential  distributions  and  Bregman  distances. 


3.5.1  Bijection  Relation 

The  bijection  between  exponential  family  and  Bregman  divergences  is  given  as  follows.  The 
bijection  theorem  states  that  any  exponential  family  distribution  can  also  be  expressed  in  the 
following  form  where  /i  is  the  expectation  parameter,  0  is  a  strictly  convex  function  and  D^(-,  •) 
is  the  Bregman  distance  function. 


Pip,e{x)  =  U(x)e 


(3.42) 


ftj,  is  defined  as 


U(x)  =  exp(0(x)  -  A(x)) 


(3.43) 
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Every  convex  function  has  an  associated  convex  function,  know  as  its  Legendre  conjugate  or 
simply  conjugate  [60].  A  convex  conjugate,  fc,  of  a  convex  function,  /  is  defined  as 


fc(x )  =  sup {xy  -  f(y)}  (3.44) 

y 

This  bijection  between  exponential  distribution  and  Bregman  distance  /.)„(•,  //)  is  even 

more  profound  since  convex  functions  0  and  0  are  the  Legendre  Conjugates  of  each  other,  and 

H  =  V0(0),0  =  V0(/0. 


3.5.2  Examples  of  Bregman  Distances  as  Natural  Distances  for  Exponen¬ 
tial  Families 

Some  examples  of  exponential  family  distributions  associated  with  commonly  used  Bregman 
distance  functions  are  as  follows, 

(1)  Euclidean  distance:  0(x)  = 


A/>(x,  n) 


l|x~HI2 

2 

exP(~  )/^(x) 


We  get  the  familiar  Gaussian  distribution.  Thus  Euclidean  distance  is  the  natural  distance  asso¬ 
ciated  with  the  Gaussian  distribution. 

(2)  KL-distance:  0(x)  =  Y2i=i  A  log  A  gives 


d, 

D^(x,y)  =  J^Xjlog  — 

^Xx)  =  n(~)  Mx) 

i=i  x  J ' 

Thus  KL  distance  is  shown  to  be  the  natural  distance  associated  with  the  multinomial  distribu¬ 
tion. 


3.6  Conclusion 

This  chapter  goes  over  important  background  concepts  that  are  utilized  in  our  research.  Some  of 
these  concepts  are  well  known  such  as  SVMs,  KNN  classifiers,  EM  algorithm  and  topic  models 
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such  as  LDA  and  PLSA.  While  some  other  concepts  are  not  known  as  well  such  as  Bregman 
divergences  and  their  relationship  to  exponential  family  of  distributions.  In  addition  to  properly 
laying  out  and  introducing  some  of  the  important  concepts  of  machine  learning,  this  chapter 
also  tries  to  weave  together  these  concepts  by  showing  the  relationships  among  them.  We  will 
refer  back  to  this  chapter  while  describing  some  of  our  own  algorithms  that  are  developed  as 
part  of  this  dissertation. 
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CHAPTER  4: 

Datasets  for  Cross  Domain  Document 

Classification 


Since  the  cross-domain  document  classification  is  a  relatively  new  area  of  research,  it  lacks 
a  common  dataset  that  researchers  can  use  for  their  experiments.  The  field  of  cross-domain 
document  classification  needs  a  dataset  that  has  large  number  of  documents  on  various  topics. 
Moreover  these  topics  need  to  be  common  across  multiple  domains.  For  example,  a  dataset 
will  require  hundreds  or  thousands  of  documents,  in  each  domain,  on  topics  such  “finance”  and 
“music.”  Therefore,  we  developed  these  datasets  for  our  experiments  to  specifically  show  the 
robustness  and  accuracy  of  our  algorithms  for  cross-domain  classification.  We  obtained  our 
text  documents  from  three  different  domains:  (1)  Wikipedia,  (2)  New  York  Times  (NYT),  and 
(3)  20-Newsgroups  data  set.  All  of  these  three  domains  contain  a  large  number  of  documents 
on  variety  of  topics.  We  extracted  different  sets  of  data,  each  consisting  of  its  own  group  of 
topics.  These  topics  were  carefully  chosen  to  reflect  some  real-world  document  categories  and 
to  ensure  that  the  topic  exists  in  all  three  domains.  These  datasets  can  also  serve  as  a  valuable 
resource  for  researchers  working  in  this  field  of  cross-domain  document  classification.  In  this 
section,  we  will  go  over  the  details  of  these  datasets  that  are  created  for  our  research. 

4.1  Wikipedia 

Wikipedia  provides  a  massive  data  source  for  research  in  the  area  of  document  classification. 
With  over  16  GB  of  text  data,  over  8  million  titles,  and  over  20  million  page-category  links  [14] 
[61],  extracting  useful  information  from  Wikipedia  requires  carefully  developed  data  structures 
and  optimized  algorithms.  To  download  Wikipedia  articles,  one  could  write  a  computer  program 
to  download  the  HTML  pages  directly  from  the  Web  and  then  to  use  packages  such  as  “beautiful 
soup”  [62]  to  strip  off  all  the  HTML  tags  to  get  the  pure  text.  This  type  of  approach  will  work 
well  if  one  only  needs  to  download  a  few  articles  or  if  the  title  of  the  article  is  known  in  advance. 
However,  for  a  task  that  requires  building  an  entire  training  set  by  getting  thousands  of  articles 
under  different  categories,  downloading  individual  pages  using  a  Web  API  would  have  been  too 
slow. 

Our  dataset  was  prepared  by  the  original  text  files  of  the  articles  written  in  Wiki-script.  Wikipedia 
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provides  text  from  all  the  articles  at  the  following  URL:  download.wikimedia.org/enwiki/latest/ 
in  a  zipped  xml  file.  This  link  provides  various  other  files  such  as  files  containing  all  the  previ¬ 
ous  versions  of  every  article.  Our  data  of  Wikipedia  was  obtained  by  downloading  the  xml  file 
containing  the  current  version  of  all  Wikipedia  articles,  written  in  Wiki  script,  on  Oct  30,  2009. 
The  xml  file  is  a  large  (20  GB)  file  containing  tags  for  article  title  and  article  text.  The  xml 
file  does  not  contain  any  explicit  labeling  or  categorization  of  the  articles.  The  tags  used  in  the 
xml  file  do  not  contain  the  category  information.  The  xml  structure  of  the  file  is  very  simple, 
containing  only  two  tags,  one  for  the  title  of  the  article  and  second  for  the  text  of  the  article.  All 
other  information  about  the  article,  such  as  its  categories  etc.  had  to  be  inferred  from  the  article 
text  itself. 

Each  Wikipedia  article  can  be  assigned  to  multiple  categories.  Categories  themselves  can  be  put 
into  different  categories,  thus  creating  a  “hierarchy”  tree  of  categories.  In  fact,  what  is  produced 
by  Wikipedia  is  not  a  tree  but  a  graph  with  lots  of  cycles.  An  article  is  assigned  to  categories  by 
putting  a  [[Category:  ]]  tag  in  the  article.  For  example,  the  tag  [[Category:  Hi  story]]  in  page  x, 
will  place  page  x  under  category  “History.”  To  see  a  list  of  all  the  pages  under  a  category  e.g., 
“History”,  in  Wikipedia,  one  can  look  at  a  page  titled  “category:History.”  However,  the  con¬ 
tents  of  these  pages,  listing  the  contents  of  a  category,  are  dynamically  generated  by  using  the 
category  tags  in  various  pages.  This  is  similar  to  how  the  references  are  dynamically  generated 
at  the  end  of  the  page,  by  using  the  “ref”  tags  from  the  body  of  the  article.  We  first  collect  the 
category-title  information  by  browsing  all  the  articles  and  collecting  the  “[[Category:  ]]”  tags. 
We  can  then  make  a  graph  of  the  category  hierarchy  and  easily  access  all  the  articles  under  a 
given  category. 

To  be  able  to  extract  articles  from  a  given  category,  we  implemented  some  intermediate  paging 
and  indexing  files  for  faster  search  through  the  large  xml  file.  For  the  text  mining  related  task, 
we  need  to  obtain  a  set  of  articles  under  a  specific  category  and  use  the  category  as  the  document 
class  label.  Since  categories  can  have  sub-categories,  the  pages  under  a  given  category  may  not 
appear  directly  under  the  category  but  may  appear  in  one  of  its  sub-categories.  To  collect  all 
the  articles  under  a  given  category,  one  can  traverse  the  graph  down  from  a  category  node  to  its 
sub-category  nodes  and  collect  all  the  articles  along  the  way.  Since  this  category  hierarchy  in 
Wikipedia  is  not  strictly  a  tree  structure,  it  can  have  loops.  Figure  4.1  shows  the  hierarchical 
structure  of  the  categories  with  the  category  “History”  as  the  root  node  [63].  Therefore,  there  is 
a  risk  of  ending  up  in  an  infinite  loop  while  going  from  a  parent  node  to  its  children  nodes.  The 
problem  can  be  avoided  by  simply  marking  the  articles  as  you  collect  them  so  you  do  not  collect 
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any  duplicate  articles  and  stop  when  all  the  articles  under  a  given  node  have  been  collected.  In 
practice,  however,  as  we  go  down  the  category  hierarchical  structure,  soon  after  the  depth  one, 
we  start  to  diverge  far  from  the  original  topic  and  start  coming  across  cycles  in  the  hierarchical 
structure.  To  avoid  straying  too  far  from  the  original  category,  we  go  down  to  depth  of  one 
while  collecting  articles  under  a  given  category.  So  to  get  a  list  of  all  the  articles  under  the 
category  “History,”  we  look  at  all  the  articles  under  that  category  and  its  sub-categories.  We 
do  not  go  further  deep  into  the  categories  as  depth  of  one  gives  us  most  of  the  relevant  articles 
under  a  category.  Even  though,  in  theory,  one  could  go  to  depth  of  3  or  4  in  category  hierarchy, 
the  relevance  of  the  articles  from  the  original  root  category  greatly  decreases  after  depth  1  or  2. 
For  example,  the  path,  Law  — >  Inheritance  — >  Caste  — >  Kaji(Nepal),  shows  how  the  later 
categories  diverge  from  the  topic  of  the  root  category,  law.  Wikipedia  category  network  has 
over  8000  categories.  These  categories  ranges  from  very  specific  topics,  such  as  “TV  cartoon 
characters  of  1980s,”  to  very  broad  such  as  “Science.”  We  selected  the  categories  on  three 
criteria,  (1)  the  categories  were  not  too  general  or  too  specific  and  contained  at  least  200  articles 
in  it,  (2)  the  categories  represented  real-world  topics,  (3)  the  categories  had  a  large  number  of 
articles  in  at  least  one  other  domain. 


4.2  New  York  Times  (NYT) 

NYT  dataset  [64]  is  corpus  containing  nearly  all  the  articles  from  New  York  Times  from  January 
1987  to  June  19  2007.  The  corpus  contains  over  1.8  million  articles  spanning  over  20  years 
period.  This  consisted  of  on  average  250  articles  per  day  with  maximum  of  955  articles  in  a 
day.  There  were  very  few  days  in  this  corpus  that  had  zero  articles.  The  NYT  corpus  is  very  well 
organized  and  professionally  created  by  the  NYT  staff.  Each  article  is  contained  in  its  own  xml 
file.  These  files  are  organized  into  folders  according  to  date,  with  year,  month  and  day  making 
up  different  sub-folder  levels.  The  xml  file  contains  on  average  50  tag  values,  that  list  various 
attributes  of  the  article  such  as  “date  published,”  “author,”  “title,”  and  “lead  paragraph”  etc.  The 
main  part  of  the  article  is  written  under  the  tag  “body.”  The  “body”  tag  itself  is  comprised  of 
“body.head”  and  “body.content”  sub  tags.  To  prepare  the  dataset  from  NYT  articles,  we  are 
mostly  interested  in  a  tag  titled,  “category”  or  “topic”,  that  we  could  use  to  generate  a  labeled 
dataset.  The  corpus  however  does  not  contain  a  one  single  tag  that  can  be  used  to  describe  the 
category  or  topic  of  the  article.  After  analyzing  all  the  xml  tags  that  could  serve  as  category,  we 
made  a  list  of  the  following  seven  potential  candidates  for  the  category  tag  4.1. 

“M”  and  “S”  denote  whether  the  field  can  have  multiple  values  or  not.  For  example,  the  entry  in 
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Figure  4.1:  Figure  showing  the  hierarchical  network  of  categories  in  Wikipedia  with  the  category  “History”  as  the 
root  node.  Category  structure  is  not  strictly  a  tree  and  can  have  loops. 


the  first  line  of  the  Table  4.1  has  a  value  of  “M”  that  means  in  an  article,  on  “Tom  Brokaw,”  the 
tag  “Biographical  Categories”  can  have  multiple  values  such  as  “journalism”  and  “television.” 
Out  of  these  seven  possible  choices  for  the  category  tags,  we  choose  four  different  fields  as 
labels  for  the  document,  namely,  (1)  Descriptors,  (2)  General  Online  Descriptors,  (3)  Online 
Descriptors,  and  (4)  Taxonomic  Classifiers.  An  article  is  assigned  all  the  labels  that  occur  in 
at  least  one  of  these  chosen  fields.  Therefore,  to  gather  the  articles  that  belonged  to  category 
“opera,”  all  the  articles  that  had  opera  in  at  least  one  of  these  tags  will  be  gathered.  Table  4.2 
shows  all  the  categories  in  different  groups  as  they  are  named  in  Wikipedia  and  New  York 
Times. 

Following  five  pairs  of  Tables  from  4.3  to  4.12  outline  the  content  of  each  dataset  created  from 
five  groups,  namely,  (1)  Arts,  (2)  Computers,  (3)  Current  Affairs,  (4)  Science  and,  (5)  Social 
Sciences.  A  group  consists  of  two  tables,  the  first  table  presents  the  counts  of  articles  from  each 
of  the  categories  in  that  group,  while  the  second  table  provides  technical  information  about 
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Values 

Tag  Name 

Description 

M 

Biographical  Categories 
(ex:  Books  and  Magazines) 

category  in  which  featured  individual  belongs  to 

M 

Descriptors 

(ex:  Data  Processing) 

descriptive  terms  corresponding  to  subject 

S 

Feature  Page 
(ex:  Education) 

name  of  page  article  appears  in 

M 

General  Online  Descriptors 
(ex:  Research,  Surfing) 

general  description 

M 

Online  Descriptors 

(ex:  Computers  And  the  Internet) 

descriptive  terms  corresponding  to  topics 

S 

Online  Section 

(ex:  Business;  Technology) 

name  of  the  section  placed  under  nytimes.com 

M 

Taxonomic  Classifiers 
(ex:  Top/News/Technology) 

hierarchy  of  taxonomic  descriptors 

Table  4.1 :  List  of  seven  potential  candidates  for  the  category  tags  in  the  XML  files  of  the  NYT  dataset.  “M”  and  “S” 
denote  whether  the  field  can  have  multiple  values  or  not. 


the  dataset,  such  as  the  vocabulary  size,  average  size  of  the  documents  etc.  These  datasets 
were  collected  from  two  domains,  Wikipedia  and  New  York  Times  (NYT)  and  the  tables  give 
information  pertaining  to  both  of  these  domains. 

Table  4.3  presents  results  for  the  group  “Arts,”  which  includes  categories  such  as  “Film,”  “Liter¬ 
ature,”  “Music”  etc.  Table  4.3  shows  that  the  number  of  articles  obtained  from  Wikipedia  under 
the  category  “Film”  is  997  and  from  NYT  under  the  same  category  is  1618.  Table  4.4  gives 
some  more  technical  details  of  this  dataset.  The  combined  vocabulary  size,  with  only  the  stop 
words  removed,  is  118,778.  As  one  can  see  that  this  is  a  large  vocabulary  size  and  can  be  re¬ 
duced.  We  reduce  it  by  removing  the  rare  words  from  the  dataset  and  the  words  that  appear  only 
in  one  of  the  two  domains.  By  removing  the  words  that  occur  less  than  six  times  in  the  entire 
dataset,  we  almost  reduce  the  vocabulary  size  by  50%.  This  can  be  seen  by  the  second  entry  in 
the  Table  4.4,  which  is  61,688.  We  show  in  our  experiments  that  we  do  not  get  any  significant 
decrease  in  the  accuracy  by  removing  all  the  rare  words  that  occur  less  than  100  times  in  the 
entire  dataset.  This  reduces  the  dimension  by  almost  90%.  This  can  be  seen  from  Table  4.4 
in  the  third  line  that  lists  the  vocabulary  size  obtained  after  removing  the  words  occurring  less 
than  100  times  to  be  10,373.  Since  by  removing  words  that  occur  less  than  100  times,  we  do 
not  lose  any  classification  accuracy,  we  (unless  otherwise  stated)  use  this  dataset  for  our  experi¬ 
ments.  Table  4.4  goes  on  to  show  other  attributes  of  the  datasets,  such  as  “Minimum  number  of 
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words  in  a  document”  or  words  that  appear  less  than  two  times  in  one  of  the  domains  (Words  in 
domain  less  than  two  times),  etc.  The  Tables  4.5  to  4.12  give  the  same  details  for  other  groups 
of  categories. 
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Table  4.2:  Groups  of  Categories 


Group  Name 

Categories 

Category  Name  in  NYT 

Arts  (6) 

Theatre 

Music 

Opera 

Film 

Television 

Literature 

Theater 

Music 

Opera 

Motion  Pictures 

Television 

Books  and  Literature 

Current  Affairs  (6) 

Finance 

Military 

Terrorism 

Law 

Christianity 

Islam 

Finances 

Armament,  Defense  and  Military  Forces 

Terrorism 

Law  and  Legislation 

Christians  and  Christianity 

Islam 

Science  (6) 

Genetics 

Space 

Anthropology 

Medicine 

Chemistry 

Physics 

Genetics  and  Heredity 

Space 

Archaeology  and  Anthropology 

Medicine  and  Health 

Chemistry 

Physics 

Technology  (5) 

Software 

Electronics 

Internet 

Telecommunications 

Computer  Security 

Computer  Software 

Electronics 

Computers  and  the  Internet 

Telephones  and  Telecommunications 
Computer  Security 

Social  (5) 

Sociology 

Linguistics 

Economics 

Psychology 

Law 

Sociology 

Language  and  Languages 

Economic  Conditions  and  Trends 

Psychology  and  Psychologists 

Law  and  Legislation 
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Category 

Wiki 

NYT 

Film 

997 

1618 

Literature 

1747 

1798 

Music 

555 

1618 

Opera 

3187 

1780 

Television 

874 

1702 

Theatre 

497 

1771 

Table  4.3:  Number  of  documents  in  each  category  under  “Arts”  group  as  obtained  from  two  different  domains, 
Wikipedia  and  NYT 


Wiki 

NYT 

Combined  Total  Vocabulary  Size 

118778 

118778 

Common  Vocabulary  Size  (words  occuring  at  least  6  times) 

61688 

61688 

Common  Vocabulary  Size  (words  occuring  at  least  100  times) 

10373 

10373 

Other  attributes  of  data  after  removing  rare  (less  than  100  occurances)  and  unique  words 

Minimum  number  of  words  in  document 

8 

6 

Average  number  of  words  in  document 

401 

354 

Maximum  number  of  words  in  document 

10851 

4762 

Documents  with  less  than  50  words 

726 

752 

Doucments  with  less  than  20  words 

99 

209 

Words  in  domain  less  than  2  times 

6 

34 

Table  4.4:  Some  useful  attributes  of  the  dataset  for  the  “Arts”  group,  including  the  feature  vector  dimension  (size  of 
the  vocabulary)  and  documents  sizes  etc. 


Category 

Wiki 

NYT 

Computer  Security 

1069 

944 

Electronics 

1227 

866 

Internet 

864 

1495 

Software 

924 

1471 

Telecommunications 

1903 

1624 

Table  4.5:  Number  of  documents  in  each  category  under  “Computers”  group  as  obtained  from  two  different  domains, 
Wikipedia  and  NYT 
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Wiki 

NYT 

Combined  Total  Vocabulary  Size 

74733 

74733 

Common  Vocabulary  Size  (words  occuring  at  least  6  times) 

38011 

38011 

Common  Vocabulary  Size  (words  occuring  at  least  100  times) 

6970 

6970 

Other  attributes  of  data  after  removing  rare  (less  than  100  occurances)  and  unique  words 

Minimum  number  of  words  in  document 

9 

6 

Average  number  of  words  in  document 

410 

362 

Maximum  number  of  words  in  document 

7596 

3894 

Documents  with  less  than  50  words 

555 

451 

Doucments  with  less  than  20  words 

43 

75 

Words  in  domain  less  than  2  times 

3 

84 

Table  4.6:  Some  useful  attributes  of  the  dataset  for  the  “Computers”  group,  including  the  feature  vector  dimension 
(size  of  the  vocabulary)  and  documents  sizes  etc. 
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Category 

Wiki 

NYT 

Christianity 

900 

1832 

Finance 

561 

1645 

Islam 

1241 

1655 

Law 

1370 

1554 

Military 

636 

1144 

Terrorism 

432 

1113 

Table  4.7:  Number  of  documents  in  each  category  under  “Current  Affairs”  group  as  obtained  from  two  different 
domains,  Wikipedia  and  NYT 


Wiki 

NYT 

Combined  Total  Vocabulary  Size 

93997 

93997 

Common  Vocabulary  Size  (words  occuring  at  least  6  times) 

47805 

47805 

Common  Vocabulary  Size  (words  occuring  at  least  100  times) 

9129 

9129 

Other  attributes  of  data  after  removing  rare  (less  than  100  occurances)  and  unique  words 

Minimum  number  of  words  in  document 

10 

7 

Average  number  of  words  in  document 

588 

394 

Maximum  number  of  words  in  document 

9651 

4430 

Documents  with  less  than  50  words 

457 

668 

Doucments  with  less  than  20  words 

56 

59 

Words  in  domain  less  than  2  times 

6 

47 

Table  4.8:  Some  useful  attributes  of  the  dataset  for  the  “Current  Affairs”  group,  including  the  feature  vector  dimension 
(size  of  the  vocabulary)  and  documents  sizes  etc. 
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Category 

Wiki 

NYT 

Anthropology 

1533 

1437 

Chemistry 

2845 

1384 

Genetics 

2089 

1753 

Medicine 

2288 

1925 

Physics 

2719 

581 

Space 

473 

1797 

Table  4.9:  Number  of  documents  in  each  category  under  “Science”  group  as  obtained  from  two  different  domains, 
Wikipedia  and  NYT 


Wiki 

NYT 

Combined  Total  Vocabulary  Size 

119734 

119734 

Common  Vocabulary  Size  (words  occuring  at  least  6  times) 

60981 

60981 

Common  Vocabulary  Size  (words  occuring  at  least  100  times) 

11260 

11260 

Other  attributes  of  data  after  removing  rare  (less  than  100  occurances)  and  unique  words 

Minimum  number  of  words  in  document 

7 

7 

Average  number  of  words  in  document 

449 

386 

Maximum  number  of  words  in  document 

10325 

5617 

Documents  with  less  than  50  words 

1325 

723 

Doucments  with  less  than  20  words 

200 

101 

Words  in  domain  less  than  2  times 

8 

135 

Table  4.10:  Some  useful  attributes  of  the  dataset  for  the  “Science”  group,  including  the  feature  vector  dimension 
(size  of  the  vocabulary)  and  documents  sizes  etc. 
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Category 

Wiki 

NYT 

Economics 

1004 

1944 

Language 

1901 

1962 

Law 

1331 

1946 

Psychology 

1173 

488 

Sociology 

1985 

420 

Table  4.11:  Number  of  documents  in  each  category  under  “Social  Sciences”  group  as  obtained  from  two  different 
domains,  Wikipedia  and  NYT 


Wiki 

NYT 

Combined  Total  Vocabulary  Size 

102537 

102537 

Common  Vocabulary  Size  (words  occuring  at  least  6  times) 

50839 

50839 

Common  Vocabulary  Size  (words  occuring  at  least  100  times) 

9392 

9392 

Other  attributes  of  data  after  removing  rare  (less  than  100  occurances)  and  unique  words 

Minimum  number  of  words  in  document 

6 

12 

Average  number  of  words  in  document 

534 

416 

Maximum  number  of  words  in  document 

13263 

4756 

Documents  with  less  than  50  words 

567 

372 

Doucments  with  less  than  20  words 

54 

40 

Words  in  domain  less  than  2  times 

5 

77 

Table  4.12:  Some  useful  attributes  of  the  dataset  for  the  “Social  Sciences”  group,  including  the  feature  vector 
dimension  (size  of  the  vocabulary)  and  documents  sizes  etc. 


4.3  Newsgroups  Dataset 

The  dataset  containing  text  from  20  different  newsgroups  has  been  one  of  the  most  widely  used 
datasets  in  the  field  of  machine  learning,  specifically  for  text  document  classification.  It  is  a 
relatively  old  dataset  with  one  of  its  earliest  uses  dating  back  to  1995  [22].  Table  4.3  shows  the 
topics  of  the  20  newsgroups  used  in  this  dataset.  The  dataset  is  made  available  in  three  different 
formats,  with  minor  differences  among  them  [65].  The  original  dataset  contains  total  of  19,997 
posts  in  20  different  newsgroups  with  almost  1000  posts  from  each  newsgroup.  The  other  two 
versions  are  created  from  the  original  dataset  by  removing  some  less  important  or  confusing 
information  from  the  dataset,  specifically,  (1)  entries  that  were  copied  and  posted  in  multiple 
newsgroups  (duplicate  posts)  and  (2)  some  extra  header  information  such  as  path,  follow  up, 
date  etc.  We  use  one  of  these  derived  datasets  that  is  known  as  20news-18828.  It  contains,  as 
the  name  suggests,  18,828  total  posts  and  only  the  “from”  and  “subject”  fields  from  the  header. 
This  dataset  does  not  contain  any  duplicate  posts,  i.e.,  posts  that  appear  in  multiple  newsgroups 
in  the  original  dataset. 
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comp. graphics 
comp. os  .ms-windows  .misc 
comp.sys.ibm.pc.hardware 
comp.sys.mac.hardware 
comp. windows. x 

rec. autos 
rec.  motorcycles 
rec.sport.baseball 
rec.  sport,  hockey 

sci.crypt 
sci.electronics 
sci.med 
sci.  space 

misc.forsale 

talk.politics.misc 
talk.politics.guns 
talk.politics. mideast 

talk.religion.misc 
alt.  atheism 
soc  .religion. Christian 

For  our  cross-domain  document  classification  task,  we  had  to  find  similar  topics  in  other  do¬ 
mains  to  create  the  dataset  that  can  be  used  in  our  experiments.  Table  4.3  shows  the  original 
20  newsgroup  topics.  Out  of  the  20  original  topics,  we  found  12  topics  that  occurred  in  our 
other  two  domains,  Wikipedia  and  NYT.  We  prepared  four  different  groups  of  categories,  sim¬ 
ilar  to  our  experiments  with  Wikipedia  and  NYT.  These  four  groups  of  category  were  named 
“All,  “Politics,”  “Rec”  (for  recreation)  and,  “Sci  (for  science).”  Table  4.13  shows  the  categories 
in  each  one  of  these  groups.  The  groups  “Politics,”  “Rec”  and,  “Sci”  are  subsets  of  the  group 
named  “All.” 

Table  4.15  and  4.16  present  details  of  the  datasets  related  to  category  group  “All.”  Group 
“All”  includes  12  categories  including  “Automobiles,”  “Baseball,”  “Christianity”  etc.  Table 
4.15  shows  the  number  of  articles  obtained  from  the  Wikipedia,  NYT  and  Newsgroups  under 
the  category  “Automobiles”  as  780,  1922  and  988  respectively.  Table  4.16  gives  some  more 
technical  details  of  this  dataset,  in  particular  the  size  of  the  vocabulary  of  the  dataset  and  some 
statistics  on  the  document  sizes.  Each  dataset  is  generated  by  using  two  domains  at  a  time.  It 
should  be  noted  that  this  vocabulary  size  is  the  one  that  is  obtained  after  removing  the  words 
that  occurred  less  than  100  times  in  the  dataset  and  the  words  that  appeared  only  in  one  of 
the  two  domains.  The  vocabulary  thus  changes  depending  on  which  two  domains  are  used. 
The  Table  4.16  shows  that  the  vocabulary  size  for  the  group  “All”  from  the  domain  pair,  Wiki- 
NYT,  is  42,410  and  from  the  domain  pair  Wiki-Newsgroups  is  34,705  and  from  the  domain  pair 
NYT-Newsgroups  is  32,357. 
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Table  4.13:  Groups  of  Categories 


Group  Name 

Categories 

Category  in  NYT 

Category  in  Newsgroups 

All  (12) 

Automobiles 

Automobiles 

rec.  autos 

Baseball 

Baseball 

rec.sport.baseball 

Christianity 

Christians  and  Christianity 

soc.religion.christian 

Cryptography 

Computer  Security 

sci. crypt 

Electronics 

Electronics 

sci. electronics 

Gun 

Gun  Control 

talk,  politics. guns 

Hockey 

Hockey,  Ice 

rec.sport.hockey 

Medicine 

Medicine  and  Health 

sci.med 

Middle  East 

Top\News\World\MiddleEast 

talk.politics. mideast 

Motorcycles 

Motorcycles,  Motor  Bikes,  Motor  Scooters 

rec. motorcycles 

Politics 

Politics  and  Government 

talk.politics. misc 

Space 

Space 

sci.  space 

Politics  (3) 

Guns 

Gun  Control 

talk,  politics. guns 

MiddleEast 

Top\News\World\MiddleEast 

talk.politics  .mideast 

Politics 

Politics  and  Government 

talk.politics. misc 

Rec  (4) 

Automobiles 

Automobiles 

rec.  autos 

Baseball 

Baseball 

rec.sport.baseball 

Hockey 

Hockey,  Ice 

rec.sport.hockey 

Motorcycles 

Motorcycles,  Motor  Bikes,  Motor  Scooters 

rec. motorcycles 

Sci (4) 

Cryptography 

Computer  Security 

sci. crypt 

Electronics 

Electronics 

sci.electronics 

Medicine 

Medicine  and  Health 

sci.med 

Space 

Space 

sci.  space 

Table  4.14:  Table  showing  the  categories  in  each  one  of  the  groups  used  in  the  cross-domain  classification  experi¬ 
ments  using  Wikipedia,  NYT  and  Newsgroups  as  the  three  domains.  Four  different  groups  of  categories  were  used 
to  generate  four  different  datasets.  The  groups  are  “All,”  “Politics,”  “Rec”  and,  “Sci.” 
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Category 

Wiki 

NYT 

Newsgroups 

Automobiles 

780 

1922 

988 

Baseball 

552 

1780 

994 

Christianity 

912 

1927 

996 

Cryptography 

718 

1212 

991 

Electronics 

1454 

966 

981 

Guns 

63 

957 

910 

Hockey 

139 

1814 

998 

Medicine 

2383 

1811 

988 

MiddleEast 

849 

1508 

940 

Motorcycles 

291 

420 

992 

Politics 

1714 

1415 

775 

Space 

482 

1941 

985 

Table  4.15:  Number  of  documents  in  each  category  under  “AH”  group  as  obtained  from  three  different  domains, 
Wikipedia,  NYT  and  Newsgroups. 


Wiki-NYT 

Wiki-News 

News-NYT 

Vocabulary  Size 

42410 

34705 

32357 

Min.  #  of  words  in  document 

6 

8 

6 

7 

7 

7 

Ave.  #  of  words  in  document 

604.2 

416.9 

599.5 

155.8 

155.5 

414.2 

Max.  #  of  words  in  document 

11985 

5252 

11891 

6553 

6591 

5162 

Table  4.16:  Some  useful  attributes  of  the  dataset  for  the  “All”  group,  including  the  feature  vector  dimension  (size  of 
the  vocabulary)  and  size  of  the  documents  etc. 


Category 

Wiki 

NYT 

Newsgroups 

Guns 

63 

966 

910 

MiddleEast 

850 

1581 

940 

Politics 

1720 

1545 

775 

Table  4.17:  Number  of  documents  in  each  category  under  “Politics”  group  as  obtained  from  three  different  domains, 
Wikipedia,  NYT  and  Newsgroups. 


Tables  4.17,  4.18,  4.19,  4.20,  4.21,  4.22  list  similar  information  for  the  other  three  groups  of 
categories  used,  namely,  “Politics,”  “Rec”  and  “Sci.” 

4.4  Conclusion 

In  this  chapter,  we  summarized  the  different  datasets  that  we  generated  for  our  cross-domain 
document  classifications.  Since  the  filed  of  cross-domain  classification  is  relatively  new,  the 
research  community  lacks  the  datasets  suitable  for  this  research.  The  cross-domain  document 
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Wiki-NYT 

Wiki-News 

News-NYT 

Vocabulary  Size 

19013 

16231 

13473 

Min.  #  of  words  in  document 

10 

11 

10 

8 

8 

11 

Ave.  #  of  words  in  document 

745.5 

423.1 

737.2 

207.9 

206.5 

416.2 

Max.  #  of  words  in  document 

11840 

6143 

11685 

4802 

4735 

6048 

Table  4.1 8:  Some  useful  attributes  of  the  dataset  for  the  “Politics”  group,  including  the  feature  vector  dimension  (size 
of  the  vocabulary)  and  size  of  the  documents  etc. 


Category 

Wiki 

NYT 

Newsgroups 

Automobiles 

784 

1974 

988 

Baseball 

552 

1810 

994 

Hockey 

139 

1815 

998 

Motorcycles 

291 

424 

992 

Table  4.19:  Number  of  documents  in  each  category  under  “Rec”  group  as  obtained  from  three  different  domains, 
Wikipedia,  NYT  and  Newsgroups. 


Wiki-NYT 

Wiki-News 

News-NYT 

Vocabulary  Size 

16526 

10636 

15507 

Min.  #  of  words  in  document 

6 

11 

6 

5 

6 

11 

Ave.  #  of  words  in  document 

484.453 

395.585 

468.979 

111.47 

114.238 

393.737 

Max.  #  of  words  in  document 

7981 

4172 

7926 

6219 

6417 

4130 

Table  4.20:  Some  useful  attributes  of  the  dataset  for  the  “Rec”  group,  including  the  feature  vector  dimension  (size  of 
the  vocabulary)  and  size  of  the  documents  etc. 


Category 

Wiki 

NYT 

Newsgroups 

Cryptography 

719 

1217 

991 

Electronics 

1456 

971 

981 

Medicine 

2387 

1986 

988 

Space 

482 

1952 

985 

Table  4.21:  Number  of  documents  in  each  category  under  “Sci”  group  as  obtained  from  three  different  domains, 
Wikipedia,  NYT  and  Newsgroups. 


classification  requires  datasets  that  contain  labeled  documents  in  one  domain  over  some  pre¬ 
determined  categories  and  unlabeled  documents  on  the  similar  categories  in  a  second  domain. 
With  this  type  of  data,  one  can  run  experiments  by  using  the  first  dataset  as  the  training  dataset 
and  the  second  dataset  as  the  testing  dataset.  In  the  datasets  that  are  created  as  part  of  this 
dissertation,  we  use  three  different  domains  that  all  share  common  categories  among  them.  This 
makes  it  possible  to  run  the  cross-domain  experiments  among  three  different  pairs  of  domains, 
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Wiki-NYT 

Wiki-News 

News-NYT 

Vocabulary  Size 

23426 

18533 

17100 

Min.  #  of  words  in  document 

7 

7 

7 

7 

7 

7 

Ave.  #  of  words  in  document 

480.286 

401.966 

474.368 

143.685 

142.835 

396.919 

Max.  #  of  words  in  document 

8673 

6173 

8554 

6077 

6044 

6103 

Table  4.22:  Some  useful  attributes  of  the  dataset  for  the  “Sci”  group,  including  the  feature  vector  dimension  (size  of 
the  vocabulary)  and  size  of  the  documents  etc. 


and  in  each  pair,  each  domain  can  serve  as  a  test  or  train  domain.  In  addition  to  this,  we 
create  datasets  using  four  or  five  different  category  groups,  giving  an  additional  sets  of  data 
to  test  consistency  and  robustness  of  the  cross-domain  classification  algorithms.  Our  datasets 
are  extensive  and  can  be  used  as  a  benchmark  for  future  research  on  cross-domain  document 
classification.  We  also  describe  the  way  we  created  these  datasets,  since  it  may  be  desirable 
to  update  the  datasets  from  time  to  time,  given  the  dynamic  nature  of  data  sources  such  as 
Wikipedia  and  the  introduction  of  new  domains  such  as  youtube  video  descriptions,  twitter  etc. 
These  datasets  and  the  documentation  on  the  process  to  generate  them  is  one  of  our  important 
contributions  in  this  dissertation. 
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CHAPTER  5: 
Wikipedia  Classification 


This  chapter  demonstrates  an  effective  way  of  parsing  Wikipedia  articles  for  text  classification 
purposes.  The  information  in  this  chapter  can  be  used  to  serve  two  purposes:  (1)  classification 
of  Wikipedia  articles,  (2)  classification  of  articles  from  a  different  domain,  such  as  New  York 
Times  (NYT)  news.  Here  we  describe  methods  of  data  collection  for  Wikipedia  articles  and 
empirically  analyze  different  sections  of  the  Wikipedia  articles  to  be  used  in  document  classi¬ 
fication.  We  also  show  the  effectiveness  of  our  methods  to  classify  the  given  articles  to  their 
categories. 

Each  Wikipedia  article  consists  of  standard  sections,  such  as  “introduction,”  “references,”  “links 
to  other  Wikipedia  articles”  etc.  In  this  chapter,  we  analyze  the  results  of  each  section  indepen¬ 
dently  and  propose  a  method  of  combining  some  of  these  sections  to  improve  the  classification 
accuracy.  We  use  multiple  datasets  from  Wikipedia  each  containing  documents  from  number  of 
different  categories.  Our  results  show  a  consistent  improvement  in  classification  accuracy  of  the 
articles  when  different  sections  of  the  articles  are  used  independently,  instead  of  using  the  entire 
text  of  a  Wikipedia  article  all  together  at  one  time.  Our  results  not  only  show  an  improvement 
in  classification,  but  also  show  that  using  only  parts  of  the  articles  greatly  reduces  the  size  of 
the  training  feature  vectors.  For  our  experiments,  we  classify  unlabeled  Wikipedia  and  NYT 
articles  into  their  respective  categories.  In  both  cases,  only  the  labeled  Wikipedia  articles  are 
used  for  training. 

Wikipedia  is  an  abundant  source  of  labeled  text  documents  that  is  freely  available  and  covers 
almost  all  subjects.  Wikipedia  was  launched  in  2001  and  it  has  been  growing  exponentially 
since  then  [61].  It  now  contains  over  8  million  articles  in  English  language  alone.  All  of  these 
articles  in  Wikipedia  are  assigned  to  various  categories,  which  can  be  used  as  the  document 
labels  to  prepare  a  training  set  for  a  classification  algorithm. 

Use  of  Wikipedia,  as  a  supporting  dataset,  to  classify  documents  in  a  different  domain  has  re¬ 
ceived  significant  attention.  However,  there  are  three  areas  that  need  to  be  improved  namely  (1) 
availability  of  benchmark  datasets,  (2)  documentation  on  how  to  efficiently  gather  labeled  data 
from  Wikipedia  and  (3)  how  to  use  the  Wikipedia  articles  for  text  classification  purposes.  A  set 
of  algorithms  for  effectively  exploiting  a  vast  source  of  labeled  documents,  such  as  Wikipedia, 
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for  text  document  classification  is  highly  desirable. 


This  chapter  describes  a  way  to  obtain  Wikipedia  data  for  text  mining  purposes  and  empiri¬ 
cally  analyzes  the  inherent  structure  of  Wikipedia  articles  for  the  text  classification  task.  As  our 
contribution,  we  show  that  our  method  of  parsing  and  combining  the  article  sections  improves 
classification  accuracy  of  Wikipedia  articles  as  well  as  articles  from  a  different  domain,  such 
as  New  York  Times.  For  our  experiments,  we  classify  unlabeled  Wikipedia  articles  in  to  their 
respective  categories.  We  create  our  dataset  by  collecting  articles  under  different,  but  related, 
categories.  80%  of  these  articles  are  used  for  training  and  then  are  cross  validated  on  the  other 
20%.  We  repeat  this  process  for  three  different  groups  of  five  categories  to  show  the  general¬ 
ization  of  our  results.  In  our  experiments,  we  use  three  supervised  learning  methods,  namely, 
support  vector  machines  (SVMs),  k-nearest  neighbors  (KNNs)  and  naive  bayes  (NB). 

This  chapter  of  the  thesis  is  organized  as  follows.  Section  1  outlines  the  Wikipedia  data  format 
and  discusses  how  a  good  dataset  can  be  collected  from  Wikipedia  for  text  mining  purposes. 
Section  2  gives  details  of  our  dataset  and  different  sections  of  the  Wikipedia  articles  explored  in 
our  experiments.  In  Section  3,  we  show  and  analyze  our  results  of  classification  using  different 
sections  of  the  Wikipedia  article.  Section  4  proposes  ways  of  combining  different  sections  of 
the  Wikipedia  articles  to  increase  the  accuracy  of  the  classification.  In  Section  5,  we  show 
our  results  for  cross  domain  classification  using  NYT  as  our  test  domain.  In  cross-domain 
classification,  only  the  Wikipedia  articles  are  used  to  classify  text  from  a  different  domain, 
namely  news  stories  from  NYT.  Section  6  outlines  the  future  work  and  conclusion  of  this  part 
of  our  research. 


5.1  Gathering  Data  from  Wikipedia 

We  generate  4  groups  of  categories  for  our  experiments  where  each  group  contains  categories 
that  all  belonged  under  the  umbrella  title  of  that  group.  These  groups  are  chosen  to  make  the 
classification  tasks  challenging  and  closer  to  real  life  classification  challenges.  Table  5.1  shows 
the  breakdown  of  the  groups  and  their  categories  and  the  number  of  articles  in  each  category. 
For  example,  group  named  “Arts”  includes  6  categories,  which  are  (1)  Theatre,  (2)  Music,  (3) 
Opera,  (4)  Film,  (5)  Television  and,  (6)  Literature.  The  third  columns  shows  the  number  of 
articles  in  each  of  these  categories  such  as  Theatre  category  has  497  articles. 
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Table  5.1 :  Groups  of  Categories 


Group  Name 

Categories 

Articles 

Arts  (6) 

Theatre 

497 

Music 

554 

Opera 

3189 

Film 

989 

Television 

872 

Literature 

1744 

Current  Affairs  (6) 

Finance 

561 

Military 

637 

Terrorism 

432 

Law 

1370 

Christianity 

900 

Islam 

1241 

Science  (6) 

Genetics 

2095 

Space 

481 

Anthropology 

1533 

Medicine 

2292 

Chemistry 

2869 

Physics 

2718 

Technology  (5) 

Software 

926 

Electronics 

1228 

Internet 

864 

Telecommunications 

1905 

Computer  Security 

1070 

5.2  Parsing  the  Wikipedia  Article 

Our  goal  in  this  part  of  the  research  is  to  exploit  the  article  structure  that  is  common  in  Wikipedia. 
Most  articles  in  Wikipedia  is  composed  of  a  set  of  pre-determined  sections,  such  as  most  ar¬ 
ticles  contain  an  introduction  section  in  the  beginning  of  the  article  and  a  reference  section  at 
the  end  of  the  article.  We  make  a  list  of  nine  of  these  sections  that  are  common  across  many 
articles  while  having  some  distinctive  characteristics.  These  sections  and  their  descriptions  are 
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listed  in  the  table  5.2.  For  example,  section  #2  is  named  “Main  Section”  which  contains  the 
main  body  of  the  article,  in  other  words,  the  entire  Wikipedia  article  excluding  sections  #4,  #7 
and  #8.  Figure  5.1  shows  some  of  the  different  sections  as  they  occur  in  a  typical  Wikipedia 
articles. 


Refs:  The  text  of  the  references  in  the  external 
references  section  at  the  end  of  the  article. 


outLinks:  Text  that  links  to  outside  of  Wikipedia 
refemeces  including  sections  such  as  "Refer¬ 
ences,"  ‘Extnemal  Links,"  “See  Also"  etc. 


Figure  5.1 :  A  figure  showing  different  sections  of  an  article  from  Wikipedia.  The  sections  are  shown  as  they  are 
used  in  our  experiments.  These  sections  can  be  found  in  most  of  Wikipedia  articles. 

To  generate  our  feature  vectors,  we  parse  each  article  into  its  different  sections.  After  the 
parsing,  feature  vectors  are  generated  using  the  raw  counts  of  the  words.  In  the  pre-processing 
of  the  data  before  making  the  final  feature  vector,  we  remove  the  following:  (1)  Stop  words,  (2) 
any  words  with  digits,  (3)  words  that  appear  less  than  three  times  in  the  entire  dataset.  In  our 
experiments,  we  did  not  use  any  feature  vector  from  a  document  section  that  had  less  than  10 
characters  in  it.  These  small  feature  vectors  for  a  section,  containing  less  than  10  characters  in 
them,  mostly  appeared  due  to  the  absence  of  that  section  from  the  article.  For  example,  if  an 
article  did  not  have  an  image,  then  the  feature  vector  that  contained  the  image  section  of  that 


See  also  [edit] 

•  Cradle  of  Humankind 

•  Human  history 

■  Societal  collapse 
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Table  5.2:  Description  of  individual  sections  used  in  this  chapter 


Section  # 

Name 

Description 

1 

All  Sections 

All  sections  of  the  Wikipedia  articles,  excluding 
the  names  of  the  article  categories  at  the  end 

2 

Main  Section 

The  entire  Wikipedia  article  excluding  sections 
(4),  (7)  and  (8) 

3 

Intro 

Introduction  of  the  article.  Text  that  comes  be¬ 
fore  the  start  of  any  section 

4 

inLinks 

All  the  words  that  link  to  other  Wikipedia  arti¬ 
cles 

5 

First  Words 

First  100  words  of  the  article 

6 

Blue  Words 

All  the  words  that  are  linked  either  within 
Wikipedia  or  to  an  external  website 

7 

Refs 

The  text  in  the  external  references  section  at  the 
end  of  the  article 

8 

Image  Captions 

Words  describing  the  inserted  objects  (normally 
images)  in  the  Wikipedia  articles 

9 

outLinks 

Text  that  linked  to  an  external  website,  this  in¬ 
cludes  sections  such  as  “References,”  “External 
Links” 

article  will  have  0  characters  (less  than  10)  in  it.  Among  the  four  groups  of  the  categories,  as 
discussed  before,  and  nine  initial  sections,  we  have  total  of  36  different  datasets.  Each  dataset 
containing  articles  from  categories  listed  under  one  of  the  four  groups  and  feature  vectors  that 
were  generated  using  a  particular  section  of  the  articles.  For  example,  one  dataset  would  include 
feature  vectors  from  only  the  “inLinks”  section  of  the  articles  under  the  group  “Arts.”  This  way, 
we  obtain  nine  different  training  datasets  only  for  the  category  group  “Arts,”  each  such  dataset 
can  be  used  for  classifying  the  articles. 

In  our  experiments,  we  do  use  all  36  of  these  datasets  and  compare  the  results  of  each  section 
individually.  Next  section  will  discuss  our  experimental  results  for  individual  sections  and 
provide  discussions. 
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5.3  Individual  Section  Results 


Our  goal  in  this  section  is  to  evaluate  and  compare  the  classification  accuracy  of  different  sec¬ 
tions  of  the  Wikipedia  articles.  We  use  four  different  groups  to  make  sure  that  the  results  are 
consistent  and  generalizable  across  documents  of  different  subject  matter.  Each  one  of  the 
datasets  was  classified  using  three  main  classification  algorithms,  namely  support  vector  ma¬ 
chines  (SVM),  naive  bayes  (NB)  and  /.-nearest  neighbor  (KNN)  with  cosine  similarity  as  the 
“distance”  measure.  The  number  of  nearest  neighbor  (k)  was  31.  The  accuracy  results  did  not 
vary  much  with  k,  as  we  let  k  vary  from  11  to  51.  31  was  chosen  to  be  a  suitable  number  for  6 
classes. 

Table  5.3,  shows  the  accuracy  of  different  sections  in  terms  of  percentage  on  the  “current  af¬ 
fairs”  group.  The  “current  affairs”  group  contains  the  categories  shown  in  table  5.1.  For  other 
category  group  results,  please  refer  to  the  appendix  section.  Figure  5.2  shows  the  plot  of  size 
of  different  sections  (on  x  axis)  and  the  accuracy  in  percentage.  This  graph  is  consistent  for 
different  groups  of  categories  (see  results  in  the  appendix).  We  see  that  the  additional  data  in 
the  entire  Wikipedia  article  (AllSections)  does  not  provide  any  advantage  in  getting  better  ac¬ 
curacy  when  compared  with  data  only  in  the  first  few  words  (FirstWords)  and  the  linked  words 
(BlueWords).  This  can  be  seen  by  noticing  that  bubble  numbers  2  and  3  (FirstWords  and  In¬ 
tro)  are  almost  at  the  same  height  as  the  bubbles  12  and  13  (MainSection  and  AllSections). 
Furthermore,  we  see  that  if  you  combine  sections  representing  first  few  words  (FirstWords  and 
Intro)  with  sections  representing  only  the  linked  words  (inFinks  and  BlueWords),  we  achieve 
an  accuracy  that  is  higher  than  using  the  entire  Wikipedia  article  (AllSections).  This  is  shown 
in  the  Figure  5.2  since  bubble  10  and  1 1  are  higher  than  bubbles  12  and  13. 

We  compare  and  analyze  the  accuracy  of  each  section  so  that  sections  can  be  compared  against 
each  other  in  terms  of  their  performance  and  redundancy.  From  Figure  5.2,  ImageOrFile, 
Refs  and  RefsExtFinkSeeAlso,  give  significantly  low  accuracy  results.  Therefore,  we  eliminate 
those  sections  from  our  analysis  and  only  use  6  of  the  original  9  sections.  The  comparison  is 
performed  by  analyzing  (1)  which  articles  each  section  classified  correctly  and  (2)  how  often 
different  sections  agreed  with  one  another.  We  want  to  find  the  pairs  of  sections  that  best 
complemented  each  other  and  thus  could  be  combined  to  improve  the  overall  classification.  We 
cannot  rely  on  measures  like  correlation  or  mutual  information  blindly  since  these  measures 
do  not  take  into  account  the  accuracy  of  the  individual  section.  For  example,  if  two  classifiers 
were  random  label  generators,  the  mutual  information  between  the  two  classifiers  is  likely  to 
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Table  5.3:  Classification  Accuracy  on  Individual  Sections 


ID 

Section  Name 

Average  Size  (in 
Words) 

KNN 

NB 

SVM 

1 

All  Sections 

349 

87.8 

71.5 

90.9 

2 

Main  Section 

320 

86.7 

71.7 

90.2 

3 

Intro 

63 

83.1 

70.6 

90.1 

4 

inLinks 

77 

83.7 

79.6 

89.4 

5 

First  Words 

44 

83.9 

70.9 

90.2 

6 

Blue  Words 

114 

85.1 

79.1 

89.4 

7 

Refs 

68 

70.1 

55.7 

78.8 

8 

Image  Captions 

32 

66.8 

3.1 

74.7 

9 

outFinks 

69 

74.9 

61.2 

82.7 

be  very  low  but  the  accuracy  of  both  classifiers  will  also  be  low.  Therefore,  before  using  these 
measures,  we  need  to  ensure  that  each  individual  section  being  compared  has  reasonably  high 
accuracy.  While  comparing  the  sections  for  finding  most  suitable  pairs,  we  would  like  to  find 
pairs  that  showed  less  correlation  or  low  mutual  information  while  both  sections  still  having 
high  accuracy.  To  compare  two  sections,  e.g.,  A  and  B,  or  to  measure  the  degree  of  redundancy 
in  these  sections,  we  used  correlation  coefficient  and  normalized  mutual  information  (NMI)  as 
described  in  the  paper  [66]  between  the  label  predictions  of  each  section.  The  NMI  measure 
gives  a  value  of  +1,  if  X  and  Y  are  perfectly  correlated  (either  negatively  or  positively)  and 
a  value  of  a  0  if  X  and  Y  are  independent.  The  formulation  of  NMI  measure  and  correlation 
coefficient  can  be  seen  in  Equation  5.1  and  5.2  respectively. 


NMI(X,  Y ) 


T,x,yPfay) 

~'52x,vp(x’y)  lnp(x>v) 


(5.1) 


P(X,Y) 


cov(X ,  Y) 

O x&y 


(5.2) 


Following  is  a  sample  table  for  the  value  of  NMI  (Table  5.4)  and  correlation  coefficient  (Table 
5.5)  obtained  using  the  given  formulas,  where  X  and  Y  were  raw  predicted  labels.  These 
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Figure  5.2:  Accuracy  of  different  sections  plotted  against  the  average  size  of  the  feature  vector.  X-axis  has  the 
average  number  of  words  in  the  section  and  Y-axis  shows  the  average  accuracy  obtained  by  using  only  that  section 
of  the  Wikipedia  for  classification.  These  results  are  obtained  using  “current  affairs”  group  of  categories  as  the  data 
and  SVM  as  the  classifier. 

labels  are  for  the  classes  under  the  group  “current  affairs”  and  the  classifier  SVM.  For  example, 
the  Table  5.4  shows  that  the  normalized  mutual  information  between  section  ID  3  (Intro)  and 
section  ID  6  (Blue  Words)  is  0.499  and  the  correlation  between  these  two  sections,  shown  in 
Table  5.5,  is  0.833.  In  the  rest  of  this  chapter,  to  prevent  redundancy,  wherever  the  trend  is 
similar  across  groups,  we  will  use  the  “current  affairs”  group,  and  SVM  classifier  to  present 
our  results.  Table  5.4  and  Table  5.5  show  that  a  simple  correlation  formula  also  has  the  similar 
pattern  of  independence,  however,  the  numbers  show  a  better  spread  over  the  range  in  NMI. 
The  pairs  of  section  that  show  a  low  correlation  or  NMI  measure  contain  the  least  amount  of 
redundant  information  and  since  the  sections  compared  all  had  good  accuracy,  the  pairs  that 
show  low  mutual  information  should  increase  the  accuracy  when  combined. 

5.4  Combined  Section  Results 

We  also  combine  some  of  the  sections  in  different  ways  to  create  “hybrid”  classifiers.  We  com¬ 
bine  the  classifiers  simply  by  using  both  sections  together  to  generate  the  feature  vector.  From 
the  table  of  NMI  measures  between  the  sections,  we  see  that  the  pairs  [Intro  (3)  ,  inLinks(4)], 
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Table  5.4:  NMI  Measure  Between  Wikipedia  Sections 


Section  ID 

1 

2 

3 

4 

5 

6 

1 

1.000 

0.714 

0.578 

0.549 

0.585 

0.561 

2 

0.714 

1.000 

0.564 

0.535 

0.581 

0.531 

3 

0.578 

0.564 

1.000 

0.499 

0.611 

0.499 

4 

0.549 

0.535 

0.499 

1.000 

0.498 

0.632 

5 

0.585 

0.581 

0.611 

0.498 

1.000 

0.499 

6 

0.561 

0.531 

0.499 

0.632 

0.499 

1.000 

Table  5.5:  Correlation  Measure  Between  Wikipedia  Sections 


Section  ID 

1 

2 

3 

4 

5 

6 

1 

1.000 

0.938 

0.875 

0.862 

0.882 

0.860 

2 

0.938 

1.000 

0.878 

0.868 

0.888 

0.857 

3 

0.875 

0.878 

1.000 

0.840 

0.898 

0.833 

4 

0.862 

0.868 

0.840 

1.000 

0.841 

0.891 

5 

0.882 

0.888 

0.898 

0.841 

1.000 

0.835 

6 

0.860 

0.857 

0.833 

0.891 

0.835 

1.000 

[First  Words  (5),  inLinks(4)],  [Intro  (3),  Blue  Words(6)]  and  [First  Words  (5),  Blue  Words(6)] 
best  complement  each  other.  These  pairs  are  consistent  among  different  groups  of  classes  men¬ 
tioned  in  this  chapter.  We  run  the  classification  on  these  four  pairs  of  sections.  Following  are  the 
results,  in  Table  5.6,  that  show  the  accuracy  of  these  four  pairs  along  with  the  results  obtained 
by  using  the  entire  Wikipedia  article  i.e.  section,  “All  sections”.  From  the  Table  5.6  and  the 
Figure  5.3,  we  see  that  using  combinations  of  sections  like  First  Words  and  Blue  Words  (ID  5,6) 
consistently  outperform  the  accuracy  given  by  the  entire  combined  article  (ID  1). 

A  different  way  of  combining  different  sections  of  Wikipedia  can  be  performed  by  taking  the 
majority  vote  for  the  predictions  among  different  sections.  In  this  way,  we  first  individually 
obtain  a  classification  label  from  each  section.  Then  we  assign  the  label  to  the  document  that 
is  the  given  by  the  majority  of  the  sections  independently.  For  example,  to  classify  an  article 
X,  we  first  classify  it  using  only  one  section  at  a  time  and  obtain  the  classification  labels. 
Assuming  that  two  of  the  sections  labeled  X  as  finance  and  four  of  the  sections  labeled  X  as 
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Table  5.6:  Classification  Accuracy  (in  percentage)  for  Combination  of  Sections 


ID 

Section  Name 

Average  Size  (in 
words) 

Arts 

CA 

Science 

Tech 

1 

All  Sections 

349 

92.2 

90.9 

88.8 

81.9 

3,4 

Intro-inLinks 

118 

93.6 

91.6 

89.6 

82.4 

5,4 

Firsts-inLinks 

103 

93.2 

92.1 

89.4 

83.2 

3,6 

Intro-Blues 

155 

93.2 

91.8 

89.6 

82.6 

5,6 

Firsts-Blues 

140 

93.3 

91.8 

89.7 

83.4 

law,  then  the  final  classification  label  of  X  will  be  law,  as  it  was  the  label  given  to  it  by  majority 
of  sections.  When  we  take  a  majority  vote  of  using  the  section  individually,  it  significantly 
improves  the  performance  over  using  the  entire  article  together  or  even  the  section  “First  Words 
-  Blue  Words”  alone.  Figure  5.4  shows  the  accuracy  results  as  obtained  by  three  different 
methods  of  combining  the  sections  of  a  Wikipedia  article,  namely,  (1)  all  sections  together  by 
using  the  entire  article,  (2)  using  only  the  first  few  words  of  the  article  combined  with  the  linked 
words  (sections  Firsts-Blues)  and,  (3)  using  individual  sections  independently  and  then  taking 
a  majority  vote  to  determine  the  final  classification.  Figure  5.4  shows  that  using  the  entire 
article  together  at  one  time  gives  you  the  least  accuracy  when  compared  to  the  other  methods. 
However,  if  one  does  use  the  entire  text  of  the  article  but  by  taking  majority  of  the  individual 
sections,  we  get  better  accuracy  than  using  first  words  and  the  linked  (blue)  words. 

After  classifying  Wikipedia  articles  using  different  sections,  we  also  experimented  by  classi¬ 
fying  articles  in  a  different  domain,  namely  NYT,  to  show  the  effectiveness  of  parsing  articles 
in  to  different  sections  for  cross-domain  classification.  In  our  cross-domain  classification,  only 
the  labeled  Wikipedia  articles  were  used  to  classify  the  NYT  articles.  Since  the  two  different 
domains  have  different  distribution  of  documents  in  each  category  as  well  as  difference  in  the 
writing  style,  we  expect  such  a  cross-domain  classification  accuracy  to  be  significantly  lower 
than  in-domain  (where  test  and  training  set  are  from  the  same  domain)  classification.  Even 
though,  the  accuracy  in  the  cross-domain  experiments,  as  expected,  is  lower  in  absolute  terms, 
we  still  get  an  improvement  in  the  classification  accuracy  while  using  a  parsed  Wikipedia  arti¬ 
cle  over  using  the  entire  article  together.  Figure  5.5  shows  a  clear  pattern  that  the  individual 
Wikipedia  sections  outperform  the  entire  article  together  or  at  least  do  as  good  as  using  the 
entire  article  even  in  the  case  of  cross  domain  classification. 
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Figure  5.3:  Accuracy  of  different  sections  for  difference  groups  of  categories.  The  graph  consistently  shows  that 
Section,  “All  Sections,”  does  not  perform  as  well  as  the  combination  of  first  few  words  (section  FirstWords  or  Intro) 
and  links  (section  BlueWords  or  inLinks)  in  the  Wikipedia  article.  In  addition  to  this  improvement,  one  should  also 
note  the  reduction  in  the  training  data  since  first  Words  and  Blue  Words  combined  consist  of  less  than  50%  of  the 
article. 


5.5  Conclusion 

Our  work  on  Wikipedia  shows  that  one  can  get  better  accuracy  in  classification  results  by  using 
only  the  “Blue  Words”  and  the  introduction  part  of  the  article  than  by  using  the  entire  article  of 
Wikipedia.  This  reduces  the  article  text  (or  sparse  feature  vector  length)  by  almost  70%.  We  also 
show  that  a  more  effective  way  of  using  the  entire  article  text  is  to  first  parse  it  into  different 
sections  and  then  take  a  majority  vote  of  the  different  classifiers,  created  by  using  different 
sections  individually.  We  demonstrate  that  these  results  are  consistent  among  different  groups 
of  classes.  We  also  use  Wikipedia  to  classify  a  completely  unlabeled  dataset  from  a  different 
domain,  in  this  case,  new  stories  from  NYT.  We  see  that  different  parts  of  the  Wikipedia  again 
outperform  the  entire  article  taken  together  while  classifying  documents  in  a  different  domain. 
The  initial  approach  of  using  parsed  components  of  Wikipedia  articles  instead  of  using  the 
entire  article  text  yields  promising  results.  There  are  many  different  ideas  to  further  extend  this 
research  topic.  We  can  develop  different  measures  to  combine  different  sections  and  therefore 
improve  the  cross  domain  accuracy.  In  this  chapter,  we  use  Normalized  Mutual  Information 
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Figure  5.4:  Comparison  of  accuracies  of  the  entire  article,  only  first  and  blue  words,  and  the  majority  vote  from 
different  sections.  We  see  that  using  the  entire  Wikipedia  article  does  not  give  us  the  best  accuracy.  We  achieve  a 
significant  improvement  in  the  classification  accuracy  when  different  sections  are  used  independently  and  combined 
by  majority  vote. 


(NMI)  and  correlation  coefficient  as  a  measure  to  combine  different  sections.  In  the  future,  we 
would  like  to  experiment  with  different  datasets  from  variety  of  domains. 
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Figure  5.5:  Comparison  of  accuracy  for  the  cross-domain  classification.  Even  though  the  difference  between  the 
accuracy  obtained  from  different  section  combination  is  small,  it  still  shows  an  improvement  in  the  accuracy  while 
using  the  combination  of  a  few  sections  (solid  lines)  over  the  accuracy  obtained  from  using  the  entire  section  at 
once  (dotted  red  line). 
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CHAPTER  6: 

Cross-Domain  Classification  Using  Conventional 

Algorithms 


In  the  previous  chapter,  we  discussed  how  one  can  use  Wikipedia  as  a  common  source  of  labeled 
documents  to  classify  unlabeled  documents  in  another  domain.  We  introduced  methods  that 
were  specifically  designed  to  work  with  Wikipedia  articles  as  the  source  of  labeled  training 
documents.  These  Wikipedia-specific  algorithms  parsed  a  Wikipedia  article  into  its  different 
sections,  such  as  “Introduction,”  “References,”  “External  Links,”  etc.  Although  we  achieved  an 
improvement  in  the  accuracy  of  the  cross-domain  classification  of  NYT  articles  by  using  the 
Wikipedia  specific  algorithms,  our  main  research  goal  is  to  develop  general  algorithms.  These 
general  algorithms  can  be  used  to  classify  articles  in  domain  Y  by  using  labeled  articles  from 
domain  X,  where  X  and  Y  can  be  any  two  domains. 

In  this  chapter,  we  will  first  establish  the  need  for  cross-domain  classification  by  showing  that 
conventional  algorithms  fail  to  classify  documents  accurately  when  training  and  testing  data 
is  used  from  two  different  domains.  We  will  show  that  the  average  drop  in  the  accuracy  of  a 
conventional  algorithm  going  from  single  domain  to  multi-domain  classification  ranges  from 
22%-30%.  In  the  next  chapter,  we  will  introduce  algorithms  that  improve  the  accuracy  com¬ 
pared  to  the  conventional  methods  for  the  cross-domain  document  classification  for  any  two 
domains. 

We  choose  a  set  of  most  commonly  used  classifiers  for  text  document  classification  for  our  em¬ 
pirical  studies  on  the  cross-domain  document  classification.  These  conventional  classifiers  are 
chosen  due  to  their  simplicity,  popularity  in  document  classification  research  and  robust  theo¬ 
retical  foundation.  We  choose  four  such  classifiers  including,  (1)  K-nearest  neighbor  with  Eu¬ 
clidean  distance  (KNN-Euclidean),  (2)  K-nearest  neighbor  with  Cosine  distance  (KNN-Cosine), 
(3)  Support  Vector  Machines  (SVMs)  and  (4)  Naive  Bayes  (NB). 

In  the  first  section  we  show  the  classification  results  when  we  use  the  entire  vocabulary  from 
two  domains  to  generate  our  counts  feature  vectors  as  explained  in  Chapter  3.  In  the  second 
section,  we  reduce  the  vocabulary  by  only  including  the  words  that  appear  in  both  domains  and 
appear  at  least  100  times  in  the  entire  dataset  of  two  domains.  The  motivation  to  reduce  the 
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vocabulary  stems  from  our  observation  of  the  results  as  introduced  in  the  first  section  of  this 
chapter  and  from  some  previous  works  on  text  document  classification  [67,  68]. 

In  the  conclusion  of  this  chapter,  we  also  obtain  baseline  measurements  that  can  be  used  to 
compare  our  algorithms  for  the  cross-domain  classification.  These  baseline  measurements  are 
the  best  accuracy  results  that  we  obtain  from  the  conventional  classifiers.  Choosing  the  con¬ 
ventional  classifiers  and  the  feature  vectors  that  give  us  the  best  cross-domain  classification 
performance  is  the  fairest  way  to  evaluate  our  algorithms. 

6.1  Classification  Using  Entire  Vocabulary 

In  this  section  we  present  experimental  results  of  document  classification  on  our  cross-domain 
data  using  conventional  classifiers  such  as  KNN,  SVMs  and  NB.  The  feature  vectors,  used  in 
these  experiments,  were  the  raw  word  counts  from  the  text  after  removing  the  stop  words.  The 
entire  vocabulary  from  both  domains  was  used  to  obtain  these  counts.  In  the  case  of  KNN 
classifier,  two  common  distance  measures,  namely,  cosine  similarity  and  Euclidean  distance 
measures  were  used.  The  chosen  K  for  the  KNN  classification  was  31.  This  K  was  chosen  after 
preliminary  experiments  with  the  data.  Besides  only  relying  on  the  experiments,  the  K  value 
of  3 1  was  also  chosen  because  it  is  a  reasonable  value  for  a  multi-class  classification  when  the 
number  of  classes  is  5  or  6.  Our  algorithms  show  an  improvement  over  the  best  conventional 
algorithms  for  the  cross-domain  classification.  We  use  the  results  of  the  conventional  classifier 
algorithms  that  provide  the  best  cross-domain  accuracy  as  our  benchmark  to  represent  the  result 
from  the  conventional  algorithms.  Figure  6.1  shows  the  results  of  the  conventional  classifiers 
on  the  classification  of  the  Wikipedia  and  NYT  data. 

Figure  6.1  contains  four  sub-figures  in  two  rows.  Top  row  shows  the  results  of  classifying 
NYT  articles  with  using  NYT  as  training  set  (a)  and  Wikipedia  as  training  set  (b).  Bottom  row 
shows  the  results  of  classifying  Wikipedia  articles  using  Wikipedia  as  training  set  (c)  and  NYT 
as  training  set  (d).  In  each  figure,  the  x-axis  shows  different  category  groups  such  as  “Arts,” 
“Computers”  etc.,  while  y-axis  shows  the  accuracy  in  percentage.  Figure  6.1  includes  a  legend 
listing  the  four  conventional  algorithms.  It  can  be  seen,  by  comparing  right  and  left  bar  graphs, 
that  when  training  and  testing  sets  were  from  different  domains  the  classification  accuracy  drops 
significantly.  This  drop  in  the  accuracy  varies  for  different  classifiers  as  well  as  different  group 
of  categories.  For  example,  the  average  accuracy  drop  in  the  classification  of  NYT  articles  from 
using  NYT  as  training  set  and  using  Wikipedia  as  training  set  is  28%  when  SVM  classifier  is 
used. 
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Figure  6.1 :  Classifier  Comparison  Using  the  Entire  Vocabulary  from  Two  Domains 


Table  6.1  shows  the  average  drop  in  the  accuracy  when  we  classify  across  domains,  that  is 
when  we  use  different  domains  for  test  and  training  sets.  The  percentage  drop  in  Wiki2NYT 
column  is  calculated  by  comparing  the  accuracy  obtained  from  the  NYT2NYT  column.  That 
is  the  drop  in  the  accuracy  in  classifying  NYT  articles  when  using  Wikipedia  as  the  training 
set  (Wiki2NYT)  versus  when  using  the  NYT  as  the  training  set  (NYT2NYT).  Similarly  percent 
drop  in  NYT2Wiki  column  is  the  drop  in  accuracy  compared  with  Wiki2Wiki.  We  see  that  each 
one  of  these  classifiers  has  an  average  of  25%  drop  in  the  accuracy,  with  Naive  Bayes  (NB)  and 
KNN-Euclidean  performing  the  worst,  and  KNN-Cosine  and  Support  Vector  Machines  (SVMs) 
performing  the  best. 

As  observed  from  the  Figure  6.1  and  Table  6.1,  we  see  that  NB  and  KNN-Euclidean  classifiers 
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Accuracy 

%  Drop  in 

Accuracy 

Classifier 

Wiki2Wiki 

NYT2Wiki 

NYT2NYT 

Wiki2NYT 

NYT2Wiki 

Wiki2NYT 

knncosine 

84.7 

65.7 

78.8 

63.3 

22.8 

19.9 

knneuc 

66.4 

45.2 

63.7 

46.7 

32.1 

26.6 

naivebayes 

69.4 

52.4 

68.6 

33.4 

24.1 

50.8 

svm 

88.2 

67.4 

85.7 

61.9 

24.0 

28.0 

Table  6.1 :  Table  showing  the  average  accuracy,  over  all  the  category  groups,  for  cross-domain  classification  as  ob¬ 
tained  from  different  classifiers  and  a  corresponding  percentage  accuracy  drop.  The  percentage  drop  in  Wiki2NYT 
column  is  the  drop  in  accuracy  compared  with  NYT2NYT,  similarly  percent  drop  in  NYT2Wiki  is  compared  with 
Wiki2Wiki. 


perform  the  worst  and  also  show  a  more  significant  accuracy  drops.  This  discrepancy  in  the 
classifier  performance  gives  us  a  hint  for  the  reason  of  this  drop.  Both  NB  and  KNN-Euclidean 
are  heavily  influenced  by  words  that  appear  only  in  training  or  only  in  testing  set.  For  example, 
if  NB  classifier  is  given  words  in  the  testing  data  that  it  never  observed  in  the  training  data,  it 
results  in  a  division  by  zero  (that  is  normally  handled  by  some  artificial  smoothing).  In  our  case, 
we  perform  this  smoothing  by  assigning  probability  of  10-5  to  words  that  are  not  seen  in  the 
training  data  but  are  seen  in  the  testing  data.  Euclidean  distance  between  two  vectors  v\  and 
is  also  heavily  influenced  by  words  that  appear  only  in  one  of  the  two  vectors. 

Considering  this  pattern  in  the  classifiers  and  surveying  other  literature  [67,  68,  69,  70],  we 
list  some  of  the  factors  that  may  contribute  to  this  drop  in  the  accuracy.  Following  is  our  list 
of  some  of  the  reasons  that  may  be  responsible  for  this  drop  in  the  cross-domain  classification 
accuracy: 

1.  Different  size  of  the  feature  vectors  and  different  length  of  the  articles; 

2.  Different  proportions  of  the  classes  in  each  domain  [69]; 

3.  Different  vocabulary.  Many  unique  words  that  only  appear  in  one  of  the  domains; 

4.  Different  distance  measures.  The  points  in  different  domains  may  be  clustered  or  dis¬ 
tributed  with  different  distributions,  thus  having  a  different  underlying  distance  measures 
between  them; 

5.  Different  “meaning”  of  the  same  topic  or  words.  This  is  the  most  abstract  but  probably 
one  of  the  most  important  reasons  that  explains  the  inability  of  the  conventional  classifi¬ 
cation  methods  to  generalize  across  domains. 
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This  list  also  gives  us  a  summary  of  the  issues  that  need  to  be  addressed  in  the  course  of  this 
research.  Items  (1)  and  (3)  from  the  list  above  are  supported  by  the  poor  performance  of  NB 
and  KNN-Euclidean  and  relatively  better  performance  of  SVMs  and  KNN-Cosine.  Since  KNN- 
Cosine  and  SVMs  use  the  inner  product  of  the  feature  vectors,  and  KNN-Cosine  also  is  resistant 
to  different  sizes  of  the  feature  vectors,  the  loss  in  the  accuracy  is  relatively  smaller  in  the  case  of 
KNN-Cosine  and  SVMs.  Item  (1)  in  the  list  is  also  encountered  in  single  domain  classification 
as  two  documents  can  vary  greatly  in  size  within  a  single  domain.  We  resolve  this  issue  by 
simply  normalizing  the  feature  vectors. 


6.2  Classification  Using  Intersected  Vocabulary  Obtained  by 
Removing  Rare  and  Unique  Words 

In  this  section  we  present  cross-domain  classification  results  after  removing  the  words  that  ap¬ 
pear  less  than  100  times  in  the  two  domains  (rare  words)  and  words  that  only  occur  in  one  of 
the  two  domains  (unique  words)  from  the  vocabulary.  This  intersection  of  the  two  vocabulary 
spaces,  belonging  to  two  different  domains,  addresses  the  item  (3)  in  the  list  of  reasons  as  intro¬ 
duced  in  the  previous  section.  We  notice  a  relatively  low  performance  of  KNN-Euclidean  and 
NB  even  when  the  same  domain  is  used  for  training  and  testing  e.g.,  Wiki2Wiki  or  NYT2NYT. 
This  can  be  caused  by  the  presence  of  rare  words  that  often  do  not  contribute  to  the  classification 
accuracy  [67,  68].  Even  though,  item  (3)  discusses  the  presence  of  only  the  “unique”  words, 
to  get  more  meaningful  intersection  of  the  vocabulary  spaces,  we  also  remove  the  rare  words  as 
they  have  a  similar  effect  to  the  unique  words  in  the  cross-domain  classification.  So  we  obtain 
the  intersection  of  the  two  vocabulary  spaces  by  removing  the  words  that  occur  less  than  100 
times  in  the  two  domains  and  by  removing  the  words  that  only  occur  in  one  of  the  two  domains. 

Figure  6.2  shows  the  drop  in  the  accuracy  in  percentage  when  the  vocabulary  has  been  reduced 
by  90%  by  removing  the  rare  and  unique  words.  Figure  6.2  contains  four  sub-figures  in  two  rows 
laid  out  in  the  same  manner  as  Figure  6.1.  Table  6.2  shows  the  average  drop  in  the  accuracy 
with  the  reduced  vocabulary,  the  average  is  taken  across  all  different  category  groups  of  the 
domain  pair  Wiki-NYT. 

By  removing  these  words,  we  reduce  the  total  size  of  the  vocabulary  by  over  90%.  As  expected 
we  do  not  get  any  significant  reduction  in  the  performance  of  the  classifiers.  Figure  6.2  shows 
the  graphs  of  the  accuracy  with  the  reduced  vocabulary.  We  see  an  average  drop  of  less  than  1% 
in  most  cases  and  in  some  cases  we  even  see  an  increase  in  the  accuracy. 
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Figure  6.2:  Classifier  Comparison  After  Intersecting  the  Vocabulary  Spaces 


One,  somewhat  surprising,  result  is  that  when  we  remove  the  words  that  only  occur  in  one  of  the 
domains,  it  even  increases  the  accuracy  of  the  single  domain  classification.  This  is  an  interesting 
and  useful  result  as  it  gives  an  efficient  way  to  reduce  the  dimension  of  a  text  collection  by 
removing  the  words  that  do  not  occur  in  other  similar  domains.  For  example,  these  results 
show  that  we  do  not  lose  any  accuracy  in  the  New  York  Times  article  classification  by  not  using 
the  words  that  never  appear  in  Wikipedia.  Even  though  Wikipedia  and  NYT  are  independent 
and  distinct  domains,  the  most  useful  words  for  classification  are  the  words  that  occur  in  both 
domains.  This  process  brought  down  the  average  dimension  of  the  feature  vectors  from  90,000 
to  9,000.  This  reduction  in  the  size  of  the  feature  vectors  makes  running  all  the  experiments 
possible  in  realistic  amount  of  time  without  losing  any  accuracy  in  the  process.  Unless  otherwise 
specified,  we  do  not  use  any  other  dimension  reduction  methods  such  as  principal  component 
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Accuracy 

%  Drop  in 

Accuracy 

Classifier 

Wiki2Wiki 

NYT2Wiki 

NYT2NYT 

Wiki2NYT 

NYT2Wiki 

Wiki2NYT 

knncosine 

84.7 

65.8 

79.0 

62.0 

22.6 

21.7 

knneuc 

68.6 

46.4 

65.0 

46.0 

32.6 

29.1 

naivebayes 

67.5 

52.4 

68.7 

33.2 

22.2 

51.3 

svm 

87.3 

66.7 

85.4 

60.1 

24.1 

29.7 

Table  6.2:  The  average  drop  in  the  accuracy  in  percentage  across  different  category  groups  with  the  intersected 
vocabulary  space  between  two  domains.  The  vocabulary  has  been  reduced  by  90%  by  removing  the  rare  words 
and  the  words  that  were  unique  to  one  of  the  two  domains.  Compared  with  Table  6.1  it  shows  almost  no  difference 
in  the  accuracies. 


Accuracy 

%  Drop  in 

Accuracy 

Classifier 

New2News 

Wiki2News 

Wiki2Wiki 

News2Wiki 

Wiki2News 

News2Wiki 

knncosine 

80.0 

61.8 

91.8 

73.1 

22.5 

20.5 

knneuc 

65.4 

27.5 

75.1 

57.7 

56.7 

21.6 

naivebayes 

69.4 

34.0 

78.5 

56.8 

50.2 

27.2 

svm 

85.8 

63.0 

94.0 

80.0 

26.1 

14.9 

Table  6.3:  The  average  drop  in  the  accuracy  in  percentage  across  different  category  groups  with  the  intersected 
vocabulary  space  between  two  domains,  Newsgroups-Wikipedia. 


analysis  (PCA)  etc.  in  our  experiments.  For  all  the  experiments  in  this  dissertation  we  use 
the  reduced  vocabulary,  obtained  by  removing  the  rare  words  and  the  unique  words.  Table  6.3 
and  Table  6.4  show  the  average  drops  in  the  accuracy  for  the  other  two  domain  pairs,  namely, 
Wikipedia-Newsgroups  and  Newsgroups-NYT  respectively. 

From  the  Table  6.3  and  Table  6.4,  we  show  a  similar  trend  in  the  reduction  of  the  accuracy  of 
approximately  25%  and  KNN-Cosine  and  SVM  being  the  best  classifiers.  Similar  graphs  to 
Figure  6.1  showing  the  drop  in  the  cross-domain  classification  by  conventional  methods,  for  the 
Wikipedia-Newsgroups  pair  and  NYT-Newsgroups  pair  are  included  in  the  Appendix  A. 

6.3  Conclusion 

This  chapter  shows  the  drop  in  accuracy  when  two  different  domains  are  used  as  testing  and 
training  sets  from  four  conventional  classifiers.  We  also  show  that  removing  words  that  occurred 
less  than  100  times  in  the  two  domains  does  not  decrease  the  accuracy  of  the  classification  but 
reduces  the  vocabulary  size  by  over  90%.  This  result  is  summarized  in  Figure  6.3.  Figure  6.3 
shows  the  results  of  cross-domain  and  single  domain  classification  using  entire  vocabulary  and 
the  reduced  vocabulary.  The  results  show  no  change  in  the  accuracy  after  removing  the  rare 
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Accuracy 

%  Drop  in 

Accuracy 

Classifier 

NYT2NYT 

News2NYT 

News2News 

NYT2News 

News2NYT 

NYT2News 

knncosine 

88.5 

71.3 

85.8 

65.9 

17.2 

23.3 

knneuc 

78.4 

59.7 

56.6 

35.3 

18.7 

37.8 

naivebayes 

75.2 

63.7 

77.3 

50.7 

11.5 

33.7 

svm 

92.6 

74.8 

94.5 

63.3 

17.9 

33.1 

Table  6.4:  The  average  drop  in  the  accuracy  in  percentage  across  different  category  groups  with  the  intersected 
vocabulary  space  between  two  domains,  Newsgroups-NYT. 


words  that  reduces  the  vocabulary  by  over  90%.  The  figure  also  shows  a  consistent  drop  in  the 
accuracy  by  approximately  25%  when  two  different  domains  are  used  for  testing  and  training 
sets. 

By  intersecting  the  vocabulary  of  the  two  domains  and  removing  the  rare  words,  we  achieve 
an  almost  no  drop  (less  than  1%  on  average)  in  the  accuracy  and  in  some  cases  we  even  see  an 
increase  in  the  accuracy.  This  increase  in  the  accuracy  is  not  surprising  as  rare  words  can  often 
result  in  a  more  noisy  data  and  do  not  contribute  to  the  classification  patterns,  thus  reducing  the 
overall  accuracy  [67,  68].  On  the  other  hand,  we  do  obtain  an  interesting  result  by  observing 
that  the  classification  accuracy  increases  for  single-domain  classification  after  removing  the 
words  that  do  not  occur  in  both  domains.  This  gives  us  another  way  to  reduce  the  dimensions 
of  text  data  for  classification  i.e.,  by  not  including  the  words  that  do  not  occur  in  another  similar 
domain. 

Given  the  results  in  Figure  6.3,  we  use  the  reduced  vocabulary  for  all  our  experiments.  We 
thus  obtain  a  baseline  measurement  for  the  drop  in  accuracy  by  using  the  intersected  vocabu¬ 
lary  and  primarily  comparing  our  results  with  KNN-Cosine  and  SVM  classifier.  This  baseline 
measurement  of  the  drop  in  accuracy  for  NYT-Wiki  domain  is  shown  in  Figure  6.4. 

This  chapter  introduces  and  then  elaborates  on  the  problem  of  cross-domain  document  classi¬ 
fication.  We  obtain  empirical  results  showing  a  significant  drop  in  accuracy  for  cross-domain 
classification  while  using  the  conventional  classifying  algorithms.  We  also  summarize  and  list 
some  of  the  reasons  for  this  drop.  Based  on  these  reasons  we  show  that  removing  the  rare  and 
unique  words  does  not  affect  the  accuracy  and  in  some  cases  even  increases  the  accuracy  by 
a  small  amount.  In  the  next  chapter,  we  will  introduce  more  of  our  algorithms  based  on  topic 
models  to  deal  with  the  problem  of  cross-domain  classification. 
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(a)  NYT2NYT  &  Wiki2NYT  -  Entire  Vocabulary  (b)  Wiki2Wiki  &  NYT2Wiki  -  Entire  Vocabulary 


(c)  NYT2NYT  &  Wiki2NYT  -  Reduced  Vocabu-  (d)  Wiki2Wiki  &  NYT2Wiki  -  Reduced  Vocabu¬ 
lary  lary 

Figure  6.3:  The  figure  summarizes  the  results  of  cross-domain  (yellow  bar)  and  single  domain  (green  bar)  classifi¬ 
cation  using  entire  vocabulary  (top  row)  and  the  reduced  vocabulary  (bottom  row).  The  results  show  no  change  in 
the  accuracy  after  removing  the  rare  words  and  a  consistent  drop  of  approximately  25%  when  two  different  domains 
are  used  for  testing  and  training  sets. 
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(a)  NYT2Wiki  Accuracy  Drop  (b)  Wiki2NYT  Accuracy  Drop 

Figure  6.4:  The  figure  shows  the  percentage  accuracy  drop  in  cross-domain  classification  from  conventional  clas¬ 
sifiers  for  the  domain  pair  NYT-Wiki.  The  NYT2Wiki  drop  is  compared  against  Wiki2Wiki  and  Wiki2NYT  drop  is 
compared  against  NYT2NYT. 
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CHAPTER  7: 

Novel  Cross-Domain  Classification  Algorithms 


In  this  chapter  we  explain  our  cross-domain  classification  algorithms  based  on  the  topic  models 
such  as  Latent  Dirichlet  Allocation  (LDA).  We  develop  these  algorithms  and  formulations  based 
on  theoretical  properties  of  various  distributions  as  well  as  on  our  empirical  evaluations  of  the 
topic  models.  We  present  these  different  algorithms  and  formulations  for  the  cross-domain 
document  classification  and  motivate  each  one  with  theoretical  and  empirical  observations.  Our 
algorithms,  shown  in  this  dissertation,  are  based  on  LDA,  however,  these  algorithms  can  be 
extended  to  any  other  topic  model  with  a  few  or  no  modifications. 

We  first  motivate  our  choice  of  topic  models  as  the  basis  of  our  cross-domain  classification  algo¬ 
rithms.  Topic  models,  as  introduced  in  Chapter  3,  represent  a  generative  model  for  a  collection 
of  document  set.  This  generative  model  often  assumes  that  the  collection  of  documents  covers  a 
number  of  different  topics,  where  each  document  in  the  collection  may  contain  words  belonging 
to  one  or  more  of  these  topics.  The  topic,  in  a  technical  sense,  is  a  distribution  over  the  words 
in  the  vocabulary,  however,  the  topic  can  be  assumed  to  have  some  semantic  meaning  by  itself. 
For  example,  the  topic  that  contains  higher  probability  of  words  such  as  “genes,”  “diversity,” 
“species”  may  point  to  a  semantic  topic  of  “theory  of  evolution.” 

Topic  models  give  us  a  level  of  abstraction  (or  generalization)  by  extracting  common  topics 
from  all  the  documents.  Any  document  can  then  be  represented  in  terms  of  these  topics  instead 
of  specific  words  in  the  document.  The  documents  on  similar  subjects  in  two  different  domains 
must  share  some  common  topics  between  them.  A  topic  model  such  as  LDA  is  a  tool  to  extract 
these  topics  from  the  documents  that  are  independent  of  the  domain  that  the  documents  belong 
to.  We  expect  that  it  is  this  generalization  of  documents  that  makes  the  transfer  of  learned 
information  from  one  domain  to  another  domain  possible. 

In  this  chapter,  we  first  analyze  different  components  of  LDA  that  can  help  us  use  the  topic  level 
abstraction  in  classifying  documents.  We  explain  the  process  of  obtaining  the  generalizable 
information  content  from  the  source  domain  that  is  then  used  for  the  cross-domain  document 
classification.  Then  we  go  over  two  different  ways  of  using  LDA  for  classification,  (1)  using 
the  vectors  describing  the  distribution  of  topics  in  documents  (7  vectors)  and  (2)  using  the  topic 
as  distribution  of  words  (J3  vectors).  We  describe  the  methods  and  distance  metrics  for  our 
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methods  and  give  empirical  results. 


7.1  Abstraction  of  the  Source  Domain  Using  Latent  Dirichlet 
Allocation  (LDA) 

As  previously  discussed  in  more  detail,  given  a  set  of  documents  where  each  document  is  a 
collection  of  words,  Latent  Dirichlet  Allocation  (LDA)  model  distributes  the  documents  over  k 
different  topics  and  represents  each  document  as  a  proportion  of  these  topics.  LDA  model  is 
defined  by  two  parameters,  namely  a  and  /3.  a  is  the  parameter  for  the  Dirichlet  distribution 
that  is  used  to  generate  the  topic  distributions  for  individual  documents,  (3  is  a  set  of  multino¬ 
mial  distributions  over  the  words  for  individual  topics.  Given  the  model  parameters  {/3,  a}  to 
generate  a  document  with  N  words  w  =  {u>i,  w2, . . .  ,wN}  under  LDA;  we  first  obtain  a  topic 
distribution,  9  ~  Dirichlet  (a);  then  generate  word  wn  £  { w  \ , . . . ,  wn}  by  first  sampling  a  topic 
zn  ~  Multi (9 )  and  wn  ~  Multi(/3Zn),  where  Multi  represents  the  multinomial  distribution 
function.  Topic  zn  is  a  multinomial  distribution  over  V  words,  where  V  is  the  number  of  words 
in  the  entire  vocabulary. 


Equation  7.1  gives  the  expression  for  the  probability  of  w  using  the  notation  of  Blei  et  al.  [11], 
where  w  is  a  word  vector,  given  the  model  parameters  a  and  (3. 


p(w|a,  /3) 


r(E.tti) 

n,r(a,) 


N 


V 


1 1  LI  I 

\n=l  i=  1  j= 1 


d9 


(7.1) 


Before  we  present  the  use  of  LDA  for  cross-domain  classification,  we  will  first  go  over  our 
process  of  obtaining  distributions  (generalizable  information  content)  from  the  source  domain 
and  our  terminology  for  different  components  of  our  classification  algorithms. 


7.1.1  Abstraction  of  the  Source  Domain 

In  order  to  transfer  the  learning  from  one  domain  to  another,  we  first  need  to  extract  generaliz¬ 
able  information  content  of  the  categories  in  the  source  domain.  This  generalizable  information 
content  can  also  be  treated  as  semantic  information  content  defining  the  category.  For  exam¬ 
ple,  if  we  want  to  classify  documents  about  finance  in  the  Wikipedia  using  the  set  of  finance 
documents  in  NYT,  we  need  to  extract  some  semantic  structure  of  finance  documents.  This  se¬ 
mantic  structure  of  finance  category  may  be  a  list  of  topics  that  occur  commonly  under  finance. 
These  subtopics  of  finance  can  be  terms  like  “lending,”  “banks,”  “investment,”  etc.  We  choose 
LDA  topic  model  as  our  underlying  algorithm  to  extract  this  generalizable  representation  of  the 
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documents  in  the  source  domain.  LDA  extracts  k  multinomial  distributions  that  best  describe 
the  set  of  documents  presented  to  it.  These  k  distributions  are  called  topics  and  are  represented 
by  variable  (3  in  our  framework.  In  LDA,  each  document  can  have  words  from  multiple  topics, 
where  different  words  drawn  from  different  topics.  This  makes  each  document  contain  different 
proportions  of  these  k  topics  in  it.  These  topic  proportions  or  topic  distributions  in  a  document 
are  represented  by  gamma  (7)  variable  in  our  framework.  7  differs  from  9  (introduced  earlier)  as 
in  9  represents  a  draw  from  a  Dirichlet  distribution  parameterized  by  a,  where  a  is  the  learned 
parameter  of  an  LDA  model  of  the  entire  document  set.  Whereas  7  is  a  value  of  the  Dirichlet 
parameter  inferred  from  the  model  for  a  particular  document.  The  7  vectors  bring  a  major  shift 
in  our  representation  of  documents  from  being  in  terms  of  words  to  in  terms  of  topics.  The  top¬ 
ics  obtained  from  the  LDA  (J3  vectors)  give  us  an  independent  semantic  structure  and  abstract 
information  content  of  a  category. 

Figure  7.1  shows  an  illustration  of  different  parts  of  our  method  to  extract  (3  and  7  vectors  from 
the  source  domain.  Each  shaded  rectangular  area  represents  a  matrix.  On  the  left  of  the  figure, 
we  start  with  labeled  documents  from  the  source  domain.  The  figure  uses  the  labels  from  the 
Arts  category  group  as  an  example.  There  are  total  of  c  different  categories.  Documents  from 
each  category  are  modeled  independently  with  LDA,  thus  we  obtain  k  topics  from  each  of  the 
category.  This  gives  us  total  of  c  x  k  topics  (c  sets  of  k  topics).  Each  individual  set  of  k 
topics  is  denoted  as  f3c,  where  c  is  the  category  label  and  all  k  x  c  topics  are  labeled  as  (3source, 
representing  the  topics  obtained  from  the  training  set.  We  then  assign  the  topic  proportions  to 
the  documents  in  both  target  and  source  domains  over  these  topics  using  the  LDA  inference. 
This  is  shown  in  the  last  two  matrices,  where  the  left  matrix  are  the  topic  proportions  obtained 
for  the  source  domain  documents  (7 source)  and  the  right  matrix  is  the  topic  proportions  for  the 
documents  in  the  target  domain  (7 target )■  Notice  that  both  topic  proportions,  for  training  and 
testing  set  documents,  are  obtained  using  only  the  topics  obtained  from  the  source  domain.  The 
target  and  source  domain  may  not  have  the  same  number  of  documents. 

7.1.2  Obtaining  Topics  from  the  Source  Domain,  [3  Vectors 

In  order  to  understand  the  general  nature  of  a  category,  we  wish  to  learn  the  distribution  of  each 
category  in  the  source  domain.  This  distribution  is  obtained  using  the  LDA  model  that  is  run 
independently  on  each  category  in  the  source  domain.  Let  Lsource  =  (1,2,3,...,  c}  be  a  set  of 
c  category  labels  in  the  source  (labeled)  domain.  We  leam  k  topics  from  the  documents  from 
each  category  independently,  thus  obtaining  c  set  of  k  topics,  (3source  =  {/5i,  f32:  f3s, . . . ,  f3c}, 
where  each  (3i  is  a  set  of  k  multinomial  distributions  (topics)  over  the  words  in  the  vocabulary. 
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Figure  7.1 :  Different  parts  derived  from  the  LDA  for  our  cross-domain  document  classification  framework.  We  obtain 
a  set  of  k  topics  for  each  of  the  category  in  the  source  domain  separately  and  then  obtain  the  topic  distribution  of 
the  documents  in  both  target  and  source  domain  over  all  k  x  c  topics,  where  c  is  the  number  of  categories  in  our 
source  domain. 


Figure  7.1  also  shows  how  a  set  of  k  topics  is  obtained  from  the  documents  of  each  category 
in  the  source  domain.  The  illustration  shows  an  example  of  the  category  group  “Art,”  where 
the  k  topics  are  obtained  from  each  of  its  six  subcategories,  namely  theater,  music,  opera,  film, 
television  and  literature.  Given  a  set  of  /3’s,  LDA  model  can  also  inference  the  posterior  topic 
distribution  in  a  document.  This  distribution  of  topic  in  a  document  is  represented  by  7  vector 
(also  shown  in  the  figure).  7  vector  is  a  Dirichlet  prior  of  the  distribution  of  topics  in  a  given 
document. 
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7.1.3  Obtaining  Topic  Distributions  in  a  Document,  7  Vectors 

Once  an  LDA  model  has  been  learned  from  a  set  of  documents,  a  [3  and  an  a  are  obtained  that 
model  the  set  of  documents.  This  model  can  then  be  used  to  inference  a  topic  distribution  of  any 
document  and  obtain  a  posterior  distribution  of  a  topics  in  that  document.  The  7  vectors  are  the 
vectors  that  are  Dirichlet  parameter  for  these  posterior  distributions.  A  vector  7*  can  then  also 
be  seen  as  a  topic  based  representation  of  a  document,  d*.  Blei  et  al.  use  this  topic  distribution 
representation  of  the  documents  to  classify  the  documents.  We  will  have  more  discussion  on 
using  7  vectors  for  classification  in  the  next  section. 

7.2  Topic  Distribution  (7’s)  for  Cross-Domain  Document  Clas¬ 
sification 

In  this  section,  we  will  discuss  one  of  the  ways  of  using  topic  models  for  classification,  that  is  by 
comparing  the  documents  in  training  and  testing  set  using  the  topic  distribution  in  documents. 
The  topic  distributions  in  documents  are  represented  by  the  7  vectors.  A  7  vector  for  a  document 
is  a  c  x  k  dimensional  vector,  where  c  is  the  number  of  categories  in  the  training  set,  and  k  is 
the  number  of  topics  (/ 3 )  obtained  for  each  category  in  the  training  set  individually  using  LDA 
model.  We  concatenate  the  k  topics  obtained  from  each  of  the  categories  in  the  training  set 
and  then  run  the  LDA  inference  on  each  document  from  the  training  set  and  the  testing  set 
using  this  concatenated  set  of  topics.  In  this  way,  we  generate  the  topic  proportions  over  the 
documents  in  the  source  domain  and  target  domain  over  the  topics  obtained  from  the  source 
domain  categories.  Figure  7.2  shows  this  model  using  the  Bayes  network  for  LDA  model. 

In  Figure  7.2,  the  /7  represents  the  set  of  k  topics  and  at  represents  the  Dirichlet  parameter 
as  learned  by  the  LDA  model  from  the  ith  category  in  the  training  set.  The  combined  value 
of  f3  (Strain)  is  generated  by  concatenating  all  the  individual  f3’s  and  a  common  a  (atrain)  is 
generated  by  taking  the  average  of  individual  a’s. 

Once  we  obtain  the  c  x  k  topic  proportion  vectors,  one  can  explore  different  ways  to  classify 
these  the  documents  represented  as  these  topics  proportion  vectors.  Blei  et  al.  use  SVM  classi¬ 
fier  in  their  paper  to  use  the  topic  proportion  vectors  to  classify  documents.  We  experiment  with 
a  few  different  classifiers,  including  KNN  with  different  distance  measures,  SVM,  and  classi¬ 
fier  based  on  words  assignment  for  each  category.  For  KNN  classifier,  we  use  four  different 
distance  measures.  The  four  distance  measures  chosen,  all  make  a  reasonable  choice  for  the 
distance  measures  for  the  topic  proportion  vectors.  In  addition  to  using  two  commonly  used 
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Figure  7.2:  A  Bayes  network  showing  an  LDA  model  as  used  to  obtain  posterior  topic  distribution  (7)  vectors  from 
the  target  and  source  domain.  Each  7  vector  is  a  Dirichlet  parameter  for  the  topic  distribution  over  the  topics  in  the 
set  /3S  =  {£1, #2, /3s,-.., /3c},  where  each  /3,  contains  fe  topics.  Topics  (/3)  are  combined  by  concatenating  all  the 
topics  obtained  from  the  training  set  categories  and  a  parameter  is  combined  by  taking  the  average  of  individual  as 
as  obtained  by  LDA  from  each  training  set  category. 

distance  measures  in  machine  learning,  namely  Euclidean  and  Cosine  distance  measures,  we 
use  KL  distance  to  compare  the  distributions  that  are  expressed  by  the  7  vectors.  A  7  vector  is 
the  proportion  of  the  topics  in  a  specific  documents.  One  can  think  of  it  as  a  multinomial  dis¬ 
tribution  over  the  topics.  Therefore,  we  choose  our  third  distance  to  be  KL  distance.  However, 
7  vectors  under  LDA  represent  the  variational  approximation  of  the  Dirichlet  parameter  a.  The 
KL  distance  for  two  d-dimensional  Dirichlet  distribution  as  a  function  of  their  parameter  a,  (s 
and  t)  is  shown  in  Equation  7.2  [71].  In  the  Equation  7.2,  T  is  known  as  the  gamma  function 
and  ^  is  known  as  the  digamma  function.  Although,  7  vectors  can  sometimes  be  treated  as 
a  multinomial  distribution  especially  when  the  sum  of  the  its  elements  is  large  (reducing  the 
variance  of  resulting  Dirichlet  distribution),  as  our  fourth  measure,  we  do  use  a  KL  distance 
expression  specifically  for  Dirichlet  distribution  as  a  function  of  its  parameter,  a. 


We  use  the  Equation  7.2  to  compute  the  distance  between  the  two  7  vectors  as  our  fourth  choice 


for  distance  measures.  The  KL-distance  for  the  Dirichlet  distribution  has  recently  been  used  in 
a  few  other  classification  tasks  by  Hoffman  et  al.  and  Blei  et  al.  [72,  73].  Table  7.1  summarizes 
the  four  distance  measures  that  we  use  and  their  corresponding  reasons  why  we  use  these  in  the 
classification  of  documents  using  the  topic  proportion  (7)  vectors. 
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Distance  Measure 

Reason 

Euclidean 

Most  intuitive  and  widely  used  distance  measure. 

Cosine 

A  distance  commonly  used  in  text  document  classification. 

Kullback-Liebler  (KL) 

A  distance  between  two  multinomial  distributions. 

Dirichlet  KL 

KL  distance  derived  for  Dirichlet  distributions.  (Equation  7.2) 

Table  7.1 :  List  of  four  distance  measures  used,  and  the  reason  for  the  choice,  to  compare  the  7  vectors. 


We  present  our  empirical  results  by  using  the  KNN  classifier  with  these  four  distance  measures 
in  Figure  7.3.  All  four  distance  measures  vary  in  their  performance  depending  on  the  dataset 
and  category  group  used.  However,  the  KL  distance  for  multinomial  distributions  does  perform 
better  than  others  in  most  cases.  The  KL-distance  for  Dirichlet  distributions  also  has  similar 
performance  to  Cosine  and  Euclidean  measures.  Some  of  the  variations  in  the  KL  distances  may 
be  due  to  the  fact  that  these  distance  measures  are  too  sensitive  to  small  values,  thus  resulting 
in  a  log  of  zero.  Some  of  these  problems  may  be  fixed  by  smoothing  the  data  by  using  methods 
that  are  often  used  to  smooth  the  data  in  Naive  Bayes  classification. 

Another  important  thing  to  note  in  these  results  is  that  we  do  not  get  a  higher  cross-domain 
classification  accuracy  by  using  only  the  topic  proportion  vectors  for  documents.  This  result  is 
not  surprising  as  the  topic  proportion  vectors  use  less  than  1%  of  the  dimensions  as  the  original 
word  counts  vectors.  The  fact  that  the  performance  is  almost  as  good  as  a  cosine  distance 
measure  used  on  the  word  counts  vector  based  on  entire  vocabulary  is  still  an  encouraging 
result. 

We  also  classify  the  topic  proportion  vectors  using  SVMs.  The  results  of  SVM  classifier  for 
cross-domain  document  classification  are  shown  in  Figure  7.4.  The  topic  proportion  vectors  as 
obtained  from  PLSA  or  LDA  model  have  a  lot  of  flexibility  in  terms  of  topic  association  as  each 
document  can  belong  to  various  topics.  It  is  this  variation  that  yields  documents  that  are  not 
neatly  separated  into  classes,  even  though  underlying  topics  are  obtained  from  different  cate¬ 
gories.  We  seek  to  formulate  distance  measure  that  reduces  this  variation,  especially  variation 
among  the  topics  from  the  same  category.  In  a  topic  proportion  vector  the  most  important  ele¬ 
ment  for  classification  is  the  weight  or  the  number  of  words  assigned  to  the  topics  from  different 
categories.  How  the  weight  is  distributed  among  the  topics  from  the  same  category  is  not  as  im¬ 
portant.  We  thus  create  a  vector  that  is  made  of  the  percent  of  words  assigned  from  each  topic 
set,  where  each  topic  set  is  the  set  of  k  topics  obtained  from  a  single  training  category.  This, 
in  turn,  reduces  the  dimensions  of  the  feature  vector  even  further  as  all  the  elements  belonging 
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to  the  topics  of  the  same  category  are  added  up  together.  The  document  is  classified  into  the 
category  that  most  of  its  words  were  assigned  to.  For  example,  in  a  topic  proportion  vector  of 
120  dimensions,  where  20  topics  were  obtained  from  6  different  categories  in  this  classifier  we 
will  add  up  the  pack  of  20  elements  and  will  assign  the  document  to  a  category  that  yields  the 
highest  total. 

Figure  7.5  shows  the  empirical  results  of  using  the  word  assignment  weight  of  the  PLSA  and 
LDA  (PLSA  Category  Weight,  LDA  Category  Weight)  topic  proportion  vectors.  The  results 
obtained  are  compared  against  the  KL-distance  KNN  classifier  on  the  7  vectors  and  the  Cosine 
distance  on  KNN  on  the  feature  counts  vector.  In  Figure  7.5,  we  observe  that  in  most  cases,  the 
category  weights  obtained  from  the  topic  proportions  outperform  both  the  KL-distance  classifier 
on  the  7  vectors  and  the  cosine  distance  classifier  on  the  word  counts.  Although,  the  improve¬ 
ment  is  either  not  significant  in  magnitude  or  consistent  across  different  datasets  and  category 
groups.  We  point  out  three  possible  reasons  for  the  lack  of  significant  improvement  over  the 
conventional  methods  when  we  only  use  the  topic  proportion  vectors.  We  tabulate  these  three 
reasons  as  follows: 

1.  Enormous  reduction  (more  than  98%)  in  the  dimensions  of  the  document.  This  reduc¬ 
tion,  although  beneficial  in  some  cases,  also  causes  consolidation  or  loss  of  important 
information  about  the  document  category. 

2.  Flexibility  in  LDA  and  PLSA  model  to  choose  the  topic  proportion  that  best  fits  the 
document.  This  flexibility  enables  the  model  to  provide  topic  proportion  vectors  that  can 
vary  greatly  in  their  topic  assignments,  especially  when  the  underlying  topics  obtained 
from  different  categories  also  show  some  overlap.  We  come  back  to  this  point  in  our  next 
section  again. 

3.  Comparisons  are  still  being  made  among  the  documents  from  different  domains.  Ideally, 
we  should  be  able  to  leam  some  general  information  regarding  the  category  and  compare 
test  documents  from  a  new  domain  to  this  general  learned  model  of  a  category. 

Our  experiments  with  the  topic  proportion  give  us  insights  into  the  nature  of  topic  distributions. 
In  the  next  section,  we  explore  the  nature  of  the  topics  and  their  distribution  by  using  clustering 
algorithms  such  as  k-means  and  hierarchical  clustering.  We  then  develop  algorithms  based  on 
using  the  topics  themselves  to  improve  the  cross-domain  document  classification  accuracy. 
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7.3  Clustering  of  Topics  Obtained  From  the  Source  Domain 


This  section  describes  our  analysis  of  the  distribution  of  the  topics  obtained  from  different 
categories  using  clustering  method.  This  analysis,  although  important  in  its  own  right,  is  a  small 
digression  from  our  main  point  of  the  chapter,  that  is  describing  the  use  of  LDA  model  for  cross 
domain  classification.  Readers  may  choose  to  skip  this  section  in  the  interest  of  continuity  as 
it  explains  a  diagnostic  test  on  the  distribution  of  topic  vectors  and  provides  motivation  for  our 
classifier  formulations  derived  later  in  the  chapter. 

As  discussed  earlier,  we  obtain  k  topics  from  each  of  the  category  in  the  source  domain  indepen¬ 
dently.  What  we  hope  to  extract  are  the  subtopics  of  each  category.  So  the  k  topics  from  each 
of  the  category  will  have  different  focus  on  the  words  that  may  point  to  a  subtopic.  There  may 
be  some  overlap,  as  both  categories  “music”  and  “opera”  may  have  “performance”  or  “singing” 
as  their  subtopic,  but  in  general  we  will  see  a  set  unique  subtopics  for  each  of  the  category.  In 
other  words,  we  expect  the  k  topics  obtained  from  each  category  to  belong  to  its  own  cluster. 
We  investigate  this  by  clustering  the  topics  using  k-means  and  hierarchical  clustering.  When  we 
use  k-means,  we  cluster  all  the  topics  in  c  different  clusters,  where  c  is  the  number  of  categories. 


Figure  7.6  shows  the  clustering  analysis  of  the  300  topics  obtained  from  the  “Arts”  category 
group,  with  50  topics  obtained  independently  from  each  of  its  six  subcategories.  The  top  row 
of  the  image  shows  the  result  of  k-means  clustering,  where  k  for  k-means  was  chosen  to  be  6. 
The  left  most  color  bar  shows  the  cluster  assignment  of  each  point,  where  the  cluster  number 
is  encoded  with  a  unique  color  shown  in  the  color  bar  next  to  it,  for  example,  cluster  number  1 
is  dark  blue  and  cluster  number  5  is  orange  etc.  The  middle  figure  shows  the  cluster  numbers 
sorted  in  increasing  order  and  thus  showing  the  size  of  each  cluster,  we  see  that  dark  blue 
cluster  is  contains  a  little  less  than  50  points,  whereas  the  yellow  cluster  contains  approximately 
75  points.  We  also  notice  a  pathological  case  of  k-means  where  cluster  number  6  (maroon  color) 
contains  only  one  point.  The  right  most  figure  on  the  top  row  shows  the  actual  composition  of 
each  cluster  obtained.  This  figure  can  be  matched  with  the  middle  figure,  where  each  cluster 
is  overlapped  with  the  actual  color  coded  300  points.  So  we  see  that  most  of  the  dark  blue 
points  that  make  up  topics  1  to  50  are  in  cluster  3  (magenta  color)  with  some  in  the  yellow 
cluster.  Ideally,  we  would  want  to  see  exactly  one  color  shade  in  each  cluster  and  each  cluster 
of  about  50  points  in  size.  However,  as  noted  earlier,  there  is  bound  to  be  some  overlap  between 
the  subtopics  of  a  category  and  there  are  some  subtopics  that  represent  background  or  general 
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subtopic  that  may  be  common  to  all  categories.  In  addition  to  this,  we  also  realize  that  the 
k-means  algorithm  is  susceptible  to  finding  a  suboptimal  clustering,  such  as  producing  clusters 
with  only  a  single  point  in  it.  The  k-means  analysis  is  encouraging  as  it  does  show  a  good 
separation  among  all  6  categories  in  this  high  dimensional  space.  To  augment  the  k-means 
analysis,  we  also  produce  two  hierarchical  clusterings  of  the  topics  using  Euclidean  and  Cosine 
distance  measures.  The  hierarchical  clustering  trees  are  shown  in  the  bottom  row  of  Figure  7.6. 
Individual  points  representing  topics  on  the  x-axis  are  labeled  using  a  single  character  such  as 
“|”  or  “)”  to  save  space.  The  labels  correspond  to  a  single  category,  so  a  vertical  bar,  “|”  is  the 
label  for  “music”,  so  all  the  topics  obtained  from  category  music  are  labeled  with  the  vertical 
bar.  What  is  important  to  note  in  the  figure  is  that  same  labels  are  clustered  together  with  some 
patches  of  mixed  area.  For  example,  we  see  that  cluster  of  “]”  on  the  right  of  the  tree  generated 
using  the  Cosine  measure  and  on  the  left  of  the  tree  generated  using  the  Euclidean  measure. 
Figure  7.7  also  shows  the  clustering  analysis  of  the  Science  category  group  with  100  topics 
obtained  from  each  of  its  categories.  Science  category  group  contains  six  subcategories,  giving 
us  the  total  of  600  topics.  The  Figure  7.7  is  laid  out  in  similar  manner  as  the  Figure  7.6.  We  can 
see  6  clusters  in  the  hierarchical  trees.  We  also  see  that  one  of  the  leaf  is  separated  out  far  from 
all  the  other  points,  thus  giving  rise  to  a  singleton  cluster  in  the  k-means  analysis. 

We  ran  these  experiments  on  all  of  our  data  sets  with  number  of  topics  ranging  from  20  to  100 
from  each  category.  A  few  more  of  these  graphs  with  the  domain  pairs  of  Wiki-Newsgroups 
and  NYT-Newsgroups  have  been  included  in  the  Appendix.  In  each  experiment,  we  do  observe 
the  shape  of  the  hierarchical  clustering  tree  and  k-means  graphs  that  show  some  separation 
among  the  clusters  based  on  the  underlying  categories.  The  presence  of  these  clusters,  separated 
according  to  the  category,  is  an  encouraging  result  that  suggests  using  topics  (f3  vectors)  directly 
for  classification  may  be  useful.  In  the  next  section,  we  will  go  over  in  detail  about  the  distance 
metrics  and  other  formulations  for  the  cross-domain  classification  using  these  topic  vectors. 

7.4  Using  Topics  (/39s)  for  Cross-Domain  Document  Classifi¬ 
cation 

Earlier,  we  saw  that  using  7  vectors,  topic  distribution  vectors,  does  not  give  a  higher  cross¬ 
domain  classification  accuracy  than  just  comparing  the  word  counts  of  the  vectors.  This  may  be 
due  to  the  fact  that  while  using  the  7  vectors,  we  still  essentially  compare  the  documents  from 
the  source  domain  to  the  documents  in  the  target  domain,  albeit,  the  documents  are  represented 
in  terms  of  topics  instead  of  words.  In  order  to  achieve  a  higher  accuracy,  we  do  not  need  to 
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compare  the  documents  in  the  target  domain  directly  to  the  documents  in  the  source  domain, 
but  compare  the  target  domain  documents  to  a  more  generalized  knowledge  obtained  from  the 
source  domain.  This  generalized  knowledge,  in  our  case,  is  given  by  a  set  of  multinomial 
distributions  that  a  certain  category  represents. 

This  concept  of  using  the  generalized  distribution  of  the  categories  led  us  to  do  the  clustering 
analysis  on  the  topic  vectors  as  shown  in  the  previous  section.  The  results  of  clustering  analysis 
do  show  some  separation  among  the  topics  obtained  from  different  categories.  These  prelim¬ 
inary  results  support  our  hypothesis  that  the  LDA  topic  model  run  on  individual  categories 
produces  a  set  of  unique  subtopics  that  extract  the  distinctive  word  distribution  of  the  category. 

In  this  section,  we  will  develop  formulations  for  classification  and  distance  metrics  that  can  be 
applied  to  these  topics  obtained  using  the  LDA  model.  The  probability  of  a  document  (repre¬ 
sented  as  a  word  vector)  given  a  and  f3,  p(w\a,  (3),  is  computed  by  integrating  the  probability 
of  w  given  9  and  (3  (Equation  7.3)  over  all  possible  values  of  9  given  the  Dirichlet  parameter  a, 
where  9  is  a  draw  from  the  Dirichlet  distribution  parameterized  by  a. 

N  k  V 

P(w\9,  p) = n  yi  n  Wa)vpn  (v-3) 

71=1  7=1  j= 1 


According  to  the  notation  used  by  Blei  et  al.  [11],  /98J  is  the  jth  element  of  ith  topic  and  further¬ 
more,  each  word  wn  is  unit  basis  V -dimensional  vector  that  has  exactly  one  component  equal  to 
1  and  all  other  as  0.  Since  wJn  equal  to  1  only  when  j  =  n,  we  reduce  the  product  Yl)=i  (QiPij)Wn 
as  9iPitn,  where  P^n  is  the  nth  element  of  ith  (3  or  p(wn\Pi).  In  our  convention,  we  use  an  extra 
comma  in  the  subscript  of  (3  to  denote  the  words  within  a  topic,  for  example,  jth  element  of 
ith  topic  will  be  written  as  6hJ  instead  of  (3t:r  Equation  7.4  gives  us  the  short  formulation  to 
compute  the  probability  of  a  word  vector  given  9. 

N  K 

p(  w|0,  p)  =  Y[  ^2  9i^n  c 1A ) 

71=1  7=1 

Equation  7.4  gives  us  a  simplified  way  to  compare  two  documents  under  LDA  if  we  assume 
that  9  is  not  a  random  variable  but  a  fixed  known  value.  However,  9,  under  LDA,  is  a  draw 
from  a  Dirichlet  distribution  parameterized  by  a  and  it  is  thus  a  random  variable.  We  only 
know  the  distribution  of  9  under  the  LDA  model,  not  its  exact  value.  In  order  to  evaluate 
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the  probability  of  a  word  vector  given  only  a  and  /3,  we  will  have  to  use  the  Equation  7.1. 
However,  once  we  obtain  the  set  of  k  topics  from  each  of  the  category,  various  formulations 
of  an  effective  distance  metric  are  possible  that  do  not  require  inferencing  the  value  of  a  or  9 
(or  their  variational  approximation)  using  the  EM  algorithm.  We  will  present  these  different 
formulations  and  give  the  empirical  results  in  the  following  subsections. 

Figure  7.8  shows  how  a  document  from  the  test  set  (taken  from  the  target  domain)  is  compared 
with  the  topics  obtained  from  the  source  domain.  As  the  dimensionality  and  the  vocabulary 
used  for  the  documents  is  same  as  that  of  the  topics,  they  can  be  compared  directly  by  using 
any  standard  distance  measure.  In  next  section,  we  introduce  different  formulations  (based  on 
properties  of  topic  models  and  distributions)  for  comparing  topics  with  the  documents  and  we 
show  the  corresponding  empirical  results.  We  run  Wilcoxon  rank  sum  statistical  tests  to  show 
the  statistical  significance  of  our  results  [74].  These  results  are  shown  in  appendix  A  of  this 
dissertation. 

7.4.1  Document  Topic  Comparison  Formulations 

In  this  subsection,  we  will  introduce  four  different  formulations  for  classifying  the  target  do¬ 
main  documents  by  using  the  topics  from  the  source  domain.  As  we  shall  see,  these  formula¬ 
tions  are  developed  with  different  assumptions  about  the  underlying  model  for  generating  the 
target  domain  documents  using  the  source  domain  topics.  We  will  also  give  a  few  different 
interpretations  of  these  formulations,  whenever  appropriate. 

Formulation  1:  Using  Single  Topics  To  Generate  Target  Domain  Documents 

In  our  first  formulation  for  a  cross  domain  document  classifier  based  on  the  source  domain  top¬ 
ics,  we  assume  that  each  document  is  generated  from  only  one  topic.  This,  in  a  way,  is  the 
simplest  formulation  of  the  target  domain  documents  in  terms  of  the  source  domain  topics.  To 
use  this  formulation,  where  a  single  topic  is  chosen  to  be  responsible  for  generating  a  given 
document,  we  need  to  develop  distance  metrics  that  are  appropriate  to  compare  the  topic,  repre¬ 
sented  as  a  multinomial  distribution  over  the  entire  vocabulary,  and  the  documents,  represented 
as  word  count  vectors.  We  will  concentrate  on  two  main  distance  measures,  namely  Kullback 
Leibler  distance  (KL  distance)  and  cosine  distance  measure. 

Another  way  to  interpret  the  assumption,  that  a  single  topic  from  the  source  domain  generates 
a  target  domain  document,  using  the  LDA  framework  is  by  assuming  that  the  value  of  a  is 
chosen  to  be  significantly  less  than  1  and  is  close  to  0.  The  Dirichlet  distribution  with  the  a 
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parameter  approaching  zero  gets  concentrated  on  the  comers  of  the  simplex.  Samples  from 
such  a  distribution  will  contain  all  the  probability  mass  of  the  sampled  multinomial  distribution 
on  a  single  value  in  its  parameter  vector.  So  the  parameter  vector  of  the  sampled  multinomial 
distribution  has  value  of  1  in  exactly  one  of  the  places  and  0’s  in  all  other  places.  Even  though 
the  constraint  of  a  being  close  to  zero  (assigning  all  the  words  to  one  topic)  may  seem  too 
limiting  at  first  glance,  in  practice  however,  a  is  in  fact  very  small  and  a  90%  of  the  document 
is  assigned  to  5%  of  the  topics. 

We  derive  the  likelihood  expression  for  a  document  under  this  formulation  using  the  following 
assumption  about  the  value  of  a  under  LDA  model. 

a  <  1  (7.5) 

The  LDA  model,  in  this  case,  approaches  the  mixture  of  multinomial  distributions  model,  where 
all  the  words  are  assigned  to  a  single  topic.  We  can  start  with  our  simplified  LDA  probability 
computation  when  we  are  given  a  value  of  6  as  follows: 

N  I< 

P(w|0i,  P)  =  ^  OiPi,n  (7.6) 

n=  1  i= 1 

We  know  that  0t  —  1  and  9j  =  0  for  all  i  ^  j.  Using  this,  we  obtain  the  following  reduced 
expression  for  the  probability  of  the  document. 


N 

P(W|M)  =  I \Pi,n 

n=l 


(7.7) 


In  Equation  7.7,  we  obtain  the  likelihood  expression  for  the  document,  w,  given  a  topic  under 
this  formulation.  We  will  use  KL-distance  to  compare  the  target  domain  documents  with  the 
source  domain  topics.  We  will  lay  out  the  relationship  between  the  likelihood  expression  and 
the  KL-distance.  This  relationship  will  make  it  easy  for  us  to  develop  all  the  distance  metric 
expressions  in  this  section  that  we  use  for  the  cross-domain  classification.  KL  distance  between 
two  probability  distributions  P  and  Q  is  given  as  follows: 

DKL(P\\Q)  =  J2p(i)l°g^-  (7-8) 
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We  have  shown  earlier  in  Chapter  3  that  KL-distance  (also  known  as  KL-divergence)  is  the 
natural  distance  for  the  multinomial  distributions.  It  is  an  appropriate  distance  to  compare  two 
distributions  under  the  framework  of  information  theory  as  it  is  the  difference  in  the  number  of 
bits  that  one  needs  to  encode  a  message  by  using  distribution  Q,  when  that  message  is  originally 
distributed  according  to  P. 

In  this  section,  we  will  show  a  relationship  between  the  likelihood  expression  of  a  document 
given  a  topic,  8t  and  the  KL-distance  between  the  document  and  topic,  /%.  We  will  motivate  our 
choice  of  KL-distance  as  a  substitute  for  computing  the  probability  of  the  word  vector  as  shown 
earlier.  Distribution  Q  is  the  maximum  likelihood  estimate  for  points  that  are  being  generated 
using  the  true  distribution,  P,  when  the  KL-distance  between  P  and  Q  is  minimum  [75,  76]. 

If  w  is  a  set  of  words  in  a  document  and  wn  represents  the  count  of  the  nth  word,  and  the 
probability  of  w  is  given  as  follows: 


p(wi/?) = n  a>  <7-9) 

nGw 

Then  we  obtain  the  following  relationship  between  the  likelihood  and  KL-distance: 

argmax  p(w| /%)  =  argmin  A'L(w||/3,)  (7.10) 

ft  ft 

Proof:  Taking  the  log  of  both  sides,  we  get  an  expression  for  the  log-likelihood. 

\ogp(w\6,/3)  =  y>nlogA,n  (7.11) 

nGw 

We  show  that  the  KL-distance  between  w  and  (3,  KL(w,  /3),  is  minimum  for  the  (3  that  maxi¬ 
mizes  the  above  likelihood  expression. 

KL(w,  f3)  =  X]Wnlog(?")  (7-12) 

=  ^  wnlogwn  -  wnlog/3i)n  (7.13) 

nGw 

From  the  last  equation,  we  see  that  the  expression  KL( w,  8)  is  minimum  when  wn  log  A,n 

is  maximum. 
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Using  the  result  obtained  in  Equation  7.10,  we  use  the  KL  distance  to  compare  the  documents 
with  the  topics  using  the  KNN  classifier.  We  compute  the  nearest  neighbor  of  the  documents 
in  the  set  of  all  k  x  c  topics  obtained  using  the  LDA  as  described  earlier.  The  number  k  for 
the  k-nearest  neighbor  is  chosen  to  be  same  as  k  for  the  number  of  topics  obtained  from  each 
category. 

We  show  an  illustration  of  this  process  in  Figure  7.9.  Figure  7.9  shows  topics  and  the  document 
as  points  distributed  on  a  word  simplex.  A  document  with  normalized  word  counts  is  compared 
to  the  topics  using  KF-distance. 

Figure  7.10  shows  the  results  of  our  cross-domain  algorithm  that  is  based  on  the  KF-distance 
for  comparing  the  documents  to  the  source  domain  topics.  The  figure  shows  three  bars  (1)  blue 
bar,  is  the  cross-domain  classification  using  conventional  KNN  with  word  count  vectors  using 
cosine  distance  measure,  (2)  green  bar,  is  our  algorithm  where  the  target  domain  documents 
are  compared  with  the  topics  obtained  from  source  domain  using  KNN  classifier  with  KF- 
distance,  (3)  red  bar,  is  the  single  domain  classification  where  both  target  and  source  domain  are 
same  and  classification  is  done  using  KNN  with  word  count  vectors  using  the  cosine  distance 
measure.  We  compare  the  conventional  method  (blue  bar)  with  our  method  (green  bar)  and 
observe  that  the  classification  by  using  the  A-ncarcst  neighbor  classifier  when  the  topics  are 
used  as  the  training  set  (FDA-KF)  outperforms  the  classification  by  the  k-nearest  neighbor 
using  the  word  count  vectors  (KNN-Cosine).  The  results  are  consistent  across  all  three  pairs  of 
domains  (Wikipedia,  NYT  and  Newsgroups)  with  each  domain  serving  as  a  source  and  target 
domain. 

For  the  second  distance  measure,  we  consider  cosine  distance  measure,  which  can  be  described 
as  1  —  cos(w ,pi).  The  motivation  for  this  distance  measure  comes  from  the  fact  that  each 
topic  can  itself  be  treated  as  a  normalized  word  count  vector.  Since  cosine  measure  is  a  com¬ 
monly  used  measure  to  compare  two  documents,  we  can  use  the  same  measure  to  compare  the 
document  with  a  topic. 

Figure  7.11  compares  the  cross-domain  classification  accuracy  obtained  from  the  two  distance 
measures,  Cosine  and  KF-distance,  that  we  propose  to  compare  the  target  domain  documents 
to  the  topic  vectors  obtained  from  source  domain  categories.  The  graphs  show  results  of  all 
different  category  groups  explained  in  Chapter  4,  such  as  “Arts,”  “Computers”  etc.  The  blue 
bar  in  the  figure  is  the  accuracy  obtained  using  the  cosine  measure  and  the  yellow  bar  is  the 
accuracy  obtained  using  the  KF-distance.  Although,  both  measures  have  valid  reasons  to  be 
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used,  we  see  that  KL-distance  measure  does  outperform  the  cosine  distance  measure  (yellow 
bar  is  higher  than  blue  bar)  in  most  cases.  Figure  7.11  shows  the  results  for  all  three  pairs  of 
domains  (Wikipedia,  NYT  and  Newsgroups)  with  each  domain  serving  as  a  source  and  target 
domain. 

Formulation  2:  Using  Likelihoods  Computed  by  LDA  and  PLSA  Models 

In  our  second  formulation,  we  assume  that  the  target  domain  documents  are  being  generated 
by  the  LDA  model  and  the  PLSA  model  given  the  complete  set  of  /?’ s  obtained  from  a  given 
source  domain  category.  For  LDA,  this  case  assumes  a  more  realistic  value  of  a  to  be  less  than 
one,  but  not  close  to  0.  This  results  in  a  distribution  that  assigns  the  words  in  the  document  to 
three  to  five  different  topics. 

When  using  the  value  of  alpha  less  than  1,  we  get  the  following  Equation  7.14  for  probability, 
where  a  few  values  of  the  9  vector  are  non-zero.  This  is  an  equation  that  is  used  in  the  PLSA 
topic  model  as  it  does  not  have  any  Dirichlet  prior  over  9.  To  compute  the  value  of  9  that 
maximizes  the  log  likelihood  in  the  Equation  7.14,  we  will  have  to  solve  the  equation  with 
logarithm  of  sums. 

N  I< 

p(w|0i,  P)  =  9i(3Un  (7.14) 

n=  1  i= 1 

In  case  of  PLSA,  we  use  the  Jensen’s  inequality  and  EM  algorithm  to  maximize  the  log  likeli¬ 
hood  for  each  individual  set  of  /Ts,  by  using  the  following  E-step  and  M-step.  For  LDA,  we  use 
the  variational  approximation  of  the  parameters  using  the  formulations  laid  out  in  the  paper  by 
Blei  et  al.  [11]. 

The  empirical  results  for  this  formulations  are  shown  in  Figure  7.12. 

The  empirical  results  do  not  show  an  improvement  in  the  classification  over  conventional  algo¬ 
rithms.  This  is  not  surprising,  as  using  the  EM  algorithm  and  PLSA  model  to  adjust  9  vector 
for  each  set  of  /3  introduces  too  much  flexibility  in  the  likelihood  expression.  In  other  words, 
each  set  of  ft  vectors  contains  enough  combination  of  topics  to  generate  high  likelihoods,  thus 
reducing  the  effect  of  individual  topics  in  the  set.  This  result  is  similar  to  the  result  obtained 
when  the  entire  LDA  model  is  used  to  classify  the  documents. 

It  is  also  not  surprising  to  see  that  the  LDA  results  given  lower  classification  than  the  PLSA 
results,  when  the  topic  distribution  of  the  documents  are  fitted  for  each  set  of  ft  vectors  inde- 
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pendently.  In  the  PLSA  model  the  EM  algorithm  is  used  to  obtain  the  empirical  6  based  only 
on  the  words  in  the  document  without  Dirichlet  prior.  The  LDA  model  is  more  flexible  than 
the  PLSA  model  as  it  only  assigns  a  probability  distribution  over  the  topics.  In  other  words, 
the  more  adjustable  or  flexible  the  model  is  to  a  document,  the  less  classification  accuracy  it 
will  have,  as  it  is  flexible  enough  to  assign  a  high  likelihood  to  previously  unseen  documents. 
This,  however,  should  not  be  seen  as  a  disadvantage  of  LDA  or  PLSA,  on  the  contrary,  this 
flexibility  is  a  key  aspect  of  the  topic  models.  The  documents  that  we  are  trying  to  classify  here 
belong  to  one  category  group,  thus  belong  to  similar  categories.  This  similarity  in  the  categories 
generate  topics  that  have  a  considerable  overlap,  for  example,  out  of  20  topics  obtained  from 
Music  and  Opera  categories,  5  to  10  topics  from  these  two  categories  may  be  similar.  That  is  to 
say  that  a  topic  that  puts  a  high  probability  on  words  such  as  “performance,”  “singing”  etc.  will 
be  present  in  both  categories.  If  we  fit  a  document  belonging  to  category  “Opera”  according 
to  the  topics  of  category  “Music,”  the  topic  model  algorithm  will  be  able  to  find  a  combination 
of  small  subset  of  topics  from  Music  category  that  give  high  likelihood  to  this  document  from 
Opera  category.  This  fine  tuning  of  the  weights  on  different  topics  for  each  document  impedes 
the  differentiating  ability  of  a  set  of  topics  that  is  obtained  from  different  categories  of  source 
domain. 

Our  next  two  formulations  use  the  entire  set  of  topics  to  generate  the  document,  where  each 
topic  in  the  set  is  weighed  equally.  We  propose  two  different  ways  of  using  the  entire  set  of 
topics  for  generating  the  document,  (1)  using  LDA  framework  and  (2)  mixture  of  multinomial 
framework. 


Formulations  3  and  4:  Using  All  Topics  Equally  To  Generate  Target  Domain  Documents 

In  our  third  and  fourth  formulations  for  a  cross  domain  document  classifier  based  on  the  source 
domain  topics,  we  assume  that  each  document  is  generated  from  all  the  topics  obtained  from  a 
source  domain  category,  where  each  topic  was  equally  weighed  by  the  generating  topic  model. 
We  develop  two  ways  of  generating  the  target  domain  documents  using  the  source  domain  top¬ 
ics;  (1)  using  the  LDA  model,  (2)  using  the  mixture  of  multinomial  models.  Each  formulation 
produces  a  slightly  different  KL-distance  based  metric  calculation.  We  derive  the  metrics,  give 
empirical  results  and  show  a  geometric  interpretation  of  both  of  these  formulations  in  this  sec¬ 
tion. 
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Formulation  3:  Using  All  Topics  Equally  Under  LDA  Framework 

We  develop  our  third  formulation  based  on  the  LDA  framework  and  use  all  topics  equally  to 
generate  the  target  domain  documents.  One  way  to  interpret  this,  in  terms  of  a,  is  by  assuming 
the  value  of  a  is  to  be  significantly  larger  than  1.  The  Dirichlet  distribution  with  the  a  parameter 
much  larger  than  one  yields  a  uniform  distribution  over  the  simplex,  making  any  combination 
of  topics  equally  likely. 

A  large  value  of  a  makes  the  document  equally  distributed  among  all  the  topics.  On  the  simplex 
of  topics,  a  large  a  puts  almost  all  the  probability  in  the  center  of  the  simplex.  We  derive  the 
likelihood  expression  for  a  document  under  this  formulation  using  the  following  assumption 
about  the  value  of  a  under  LDA  model. 


a»  1  (7.15) 

Starting  with  our  original  simplified  expression  for  the  likelihood  under  the  LDA  model,  we 
have 

N  K 

p(w|0,  /3)  =  TT  Y  (7.16) 

n=  1  i= 1 

Since  with  the  assumed  value  of  a  to  be  greater  than  1,  we  know  that  6(  =  6  =  A  for  all 
i  G  (1, . . .  A'}.  Using  this  we  get, 


N  1  K 

p( wiM)  n  A  y  ’ 

n=  1  i=  1 


(7.17) 


Using  our  word  vector  notation,  where  w  is  a  set  of  words  in  a  document  and  wn  represents  the 
count  of  the  nth  word,  we  get  the  following  expression  for  the  probability  of  w: 


K 


pMM)=n  i 


(7.18) 


i= 1 


Taking  log  of  both  sides, 


K 


log(p(w|0,/3))  =  Wn  log 


(7.19) 


i— 1 


Using  the  result  shown  in  Equation,  7.10,  for  a  given  document,  w,  we  get  the  following  ex- 
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pression, 


K 


argmax  p(w| fa)  =  argmin  KL  w||  —  f3, 

n  n  r\  '  ^ 


ft  ft  \  \K  i= i 

we  can  also  write  this  as  follows,  where  /3  is  mean  of  3. 


(7.20) 


argmax  p(w|/%)  =  argmin  KL  (w||/3) 

ft  ft 


(7.21) 


Given  the  result  in  Equation  7.21,  the  distance  metric  that  computes  the  distance  between  a 
document  and  a  set  of  source  domain  topics  is  the  KL-distance  between  the  mean  of  the  topics 
and  the  normalized  word  count  vector  for  the  document.  This  is  a  centroid  based  distance 
used  in  the  hierarchical  clustering.  Figure  7.13  shows  an  illustration  of  this  metric  on  the  word 
simplex.  The  figure  shows  the  distance  from  a  document  to  a  set  of  topics,  which  is  computed  as 
the  KL  distance  between  the  document  and  up,  where  up  is  the  mean  of  all  the  topics  obtained 
from  a  single  source  domain  category. 

The  empirical  results  for  this  are  shown  in  Figure  7.14. 


Formulation  4:  Using  All  Topics  Under  Mixture  of  Multinomial  Framework 

In  formulation  3,  in  the  previous  section,  we  use  the  LDA  framework  to  generate  the  documents 
assuming  a  large  value  of  a.  Under  LDA  framework,  different  words  in  the  document  can  be 
assigned  to  different  topics.  Thus,  the  likelihood  expression  results  in  a  product  of  sums,  where 
the  each  word  is  chosen  independently,  and  then  the  topic  is  chosen  for  that  word. 

In  our  fourth  formulation,  we  choose  the  mixture  of  multinomial  framework,  where  all  the 
words  in  a  document  are  assigned  to  the  same  topic,  but  the  document  can  be  generated  using 
any  one  of  the  topics  with  equal  probability.  The  likelihood  expression  of  the  document  under 
this  model  is  given  as  follows: 


K  N 

p{ w|0,  fa)  =  ^2  9i  n  fkn  (7.22) 

i= 1  n= 1 

where  9  is  the  vector  signifying  the  weight  given  to  each  topic  or  the  probability  of  that  topic 
as  the  topic  generating  the  document.  Using  our  assumption  of  a  uniform  6,  we  obtain  the 
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following  simplification  of  this  likelihood  expression: 


p(w\d,/3) 


I<  N 


K 


Ell"' 

1= 1  72—  1 


(7.23) 


Taking  log  of  both  sides  to  compute  the  expression  for  the  log  likelihood,  we  obtain: 


log  (p( w\0,0))  =C  +  log 


K  N 


n  'j>ij 

,  i=  1  72—  1 


(7.24) 


Using  the  Jensen’s  inequality,  by  adding  a  new  distribution  q(6),  using  the  EM-algorithm  frame¬ 
work,  we  obtain  the  following  result: 


K 


1IV  ,  ^ 


log  (p(w|0,  p))  =  C  +  log  q(9i 


i=  1 


>c+f,wlog(n|^) 

K  /  N 

=  C-J^KL  9(9)1  in  A 


2—1 


72=1 


=  c 


*=i  V 


N 


log#)  -  log  U# 


vn=l 


(7.25) 

(7.26) 

(7.27) 

(7.28) 


We  obtain  the  Equation  7.28  by  using  the  definition  of  KL-divergence  as  defined  in  Equation 
3.40.  In  the  Equation  7.28,  we  observe  that  only  one  expression  has  Pi  in  it  and  to  minimize 
this  expression,  we  can  use  our  earlier  derived  result,  Equation  7.10  that  relates  the  likelihood 
with  the  KL  distance.  In  this  case,  to  maximize  the  likelihood  of  a  document  under  a  mixture  of 
multinomial  model,  when  all  the  multinomial  distributions  (J3)  are  equally  weighed  (uniform  6 
vector),  we  minimize  the  sum  of  the  KL  distances  between  the  document  and  the  multinomial 
distributions.  The  multinomial  distributions  are  the  topics  obtained  from  the  source  domain 
categories.  We  obtain  the  following  result: 


argmax  p(w\Pi) 


K 

argmin  p.  KL(w\\  Pp 

^  i= l 


(7.29) 


Figure  7.15  shows  an  illustration  of  this  metric  on  the  word  simplex.  The  figure  shows  the 
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distance  from  a  document  to  a  set  of  topics  is  computed  as  the  sum  of  KL  distances  between  the 
document  and  all  the  topics  obtained  from  a  single  source  domain  category. 

The  empirical  results  for  this  are  shown  in  Figure  7.16. 

7.4.2  Using  (5  Vectors  for  Same  Domain  Classification 

We  show  that  our  formulations,  (1),  (3)  and  (4)  give  us  much  better  cross-domain  classification 
when  compared  to  the  conventional  methods  such  as  K-nearest  neighbor  using  the  counts  fea¬ 
ture  vector.  In  this  section,  we  investigate  how  these  formulations  perform  for  doing  the  same 
domain  classification.  We  carry  our  experiments  by  splitting  the  documents  in  each  domain  into 
training  and  testing  set  by  using  80%  of  the  documents  for  training  and  20%  for  testing.  We  do  5 
folds  cross  validation  by  using  this  80-20  split.  We  train  an  LDA  model  on  the  documents  in  the 
training  set  for  each  class  independently  and  thus  obtaining  the  20  [3  vectors  for  each  class.  We 
then  classify  20%  of  the  test  set  documents  using  the  obtained  topic  vectors  using  three  of  our 
formulations.  We  see  that  a  conventional  KNN,  trained  on  word  counts  based  feature  vectors, 
does  outperform  the  LDA  based  KNN  formulations  in  almost  every  case.  This  result  is  expected 
as  a  counts  based  feature  vector  retain  more  information  about  the  training  set  documents  and 
is  better  suited  to  train  a  classifier  that  is  to  be  used  for  the  documents  in  the  same  domain. 

The  empirical  results  for  this  are  shown  in  Figure  7.17. 

7.4.3  Support  Vector  Machines  For  Topics 

In  the  previous  section,  we  discussed  four  different  metrics  based  on  different  assumptions  for 
the  model  generating  the  target  domain  document.  The  metrics  derived  in  the  previous  sections 
provided  a  nearest  neighbor  based  classifier  and  that  could  also  be  interpreted  using  hierarchical 
clustering  analysis.  Instead  of  only  using  the  nearest  neighbor  based  approaches  as  mentioned 
in  the  previous  subsections,  we  can  also  run  a  classifier  on  the  topics  themselves.  Based  on 
the  observation  that  the  topics  show  a  separation  on  the  simplex,  a  classifier  such  as  support 
vector  machines  (SVMs)  should  also  be  able  to  find  meaningful  separating  boundaries  on  the 
simplex,  separating  topics  obtained  from  different  categories.  Figure  7.18  shows  the  results  of 
using  SVM  for  the  cross-domain  classification,  where  the  blue  bar  is  the  result  by  conventional 
SVMs  training  on  the  word  count  vectors  and  the  green  bar  is  the  SVM  training  on  the  topics 
obtained  using  LDA. 

The  results  obtained  using  the  SVM  on  topics  give  us  better  cross-domain  accuracy  than  the 
SVM  classifier  results  obtained  using  the  conventional  word  count  feature  vectors.  The  sep- 
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arating  planes  found  by  the  SVM  classifier  are  based  on  the  distribution  of  the  topics  on  the 
simplex.  Even  though,  the  number  of  topics  is  a  lot  less  than  the  number  of  documents  in  the 
source  domain,  the  separation  among  these  few  topics  provides  us  better  separation  among  the 
categories  that  can  be  transferred  across  different  domains.  Figure  7.19  shows  an  illustration  of 
the  SVM  classifier  output  on  the  word  simplex. 

As  shown  in  the  previous  section,  we  also  can  use  this  classifier  using  the  support  vector  ma¬ 
chines  on  topics  obtained  by  LDA  for  a  single  domain  classification.  Figure  7.20  shows  the 
results  of  using  the  FDA  SVM  on  the  same  domain  and  compares  it  with  the  conventional  SVM 
on  a  single  domain.  We  see  that  a  conventional  SVM,  trained  on  word  counts  based  feature 
vectors,  does  outperform  the  FDA  based  SVM.  This  result  is  expected  as  a  counts  based  feature 
vector  retain  more  information  about  the  training  set  documents  and  is  better  suited  to  train  a 
classifier  that  is  to  be  used  for  the  documents  in  the  same  domain. 
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(a)  NYT  to  Wiki  Classification  (b)  Wiki  to  NYT  Classification 


(c)  Newsgroups  to  Wiki  Classification  (d)  Wiki  to  Newsgroups  Classification 


(e)  Newsgroups  to  NYT  Classification 


(f)  NYT  to  Newsgroups  Classification 


Figure  7.3:  Comparison  of  four  distance  measures  for  cross  domain  document  classification  using  the  topic  propor¬ 
tion  (7)  vectors  and  KNN  classifier  with  k  =  20.  The  dotted  red  line  is  using  the  cross-domain  classifier  using  the 
conventional  cosine  distance  metric.  We  see  that  generally  using  only  the  topic  proportion  vectors  for  documents 
do  not  give  us  higher  accuracy  than  using  the  counts.  This  result  is  not  surprising  as  the  topic  proportion  vectors 
use  less  than  1%  of  the  dimensions  as  the  original  word  counts  vectors.  Further,  we  observe  that  KL-distance  for 
multinomial  distributions  performs  best  most  of  the  times  and  other  three  distance  measures  all  vary  in  performance 
including  the  Dir-KL  distance  (KL  distance  for  Dirichlet  distributions). 
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(e)  Newsgroups  to  NYT  Classification 


(f)  NYT  to  Newsgroups  Classification 


Figure  7.4:  Comparison  of  SVM  classifier  using  the  topic  proportion  (7)  vectors  and  conventional  word  count  feature 
vectors.  The  cross-domain  document  classification  using  the  word  counts  is  the  blue  bar  and  using  the  topic 
proportion  feature  vectors  is  the  red  bar.  We  see  that  using  conventional  algorithms  such  as  KNN  and  SVMs  on  7 
vectors  do  not  give  us  a  higher  cross-domain  document  classification  (blue  bar  is  generally  higher  than  red  bar). 
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(a)  NYT  to  Wiki  Classification 


(b)  Wiki  to  NYT  Classification 


(c)  Newsgroups  to  Wiki  Classification 


(d)  Wiki  to  Newsgroups  Classification 
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(e)  Newsgroups  to  NYT  Classification 


(f)  NYT  to  Newsgroups  Classification 


Figure  7.5:  Comparison  of  KL-distance  KNN  and  Cosine  distance  on  word  counts  KNN  classifier  with  our  re¬ 
duced  topic  proportion  feature  vector.  Our  reduced  topic  proportion  vectors  (PLSA  Category  Weight,  LDA  Category 
Weight)  are  obtained  by  adding  up  the  weight  of  all  the  topics  obtained  from  the  same  category.  We  use  both  PLSA 
and  LDA  topic  model  to  obtain  the  topic  proportion  vectors.  A  document  is  classified  to  the  category  that  has  most 
words  assigned  to  it.  We  see  that  by  using  words  assigned  to  the  topics  of  a  particular  category,  we  do  get  a  higher 
accuracy.  In  most  cases,  we  get  a  higher  accuracy  than  conventional  cosine  distance  measure  on  the  word  counts 
(blue  and  green  line  is  higher  than  the  red  line). 
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(a)  K-means  Clustering  Analysis  of  Topics  from  Arts  Category  Group 


(b)  Hierarchical  Clustering  Tree  -  Cosine  Dis-  (c)  Hierarchical  Clustering  Tree  -  Euclidean 
tance  Distance 

Figure  7.6:  Clustering  analysis  of  the  300  topics  obtained  the  “Arts”  category  group,  with  50  topics  obtained  inde¬ 
pendently  from  each  of  its  six  subcategories.  The  left  most  color  bar  shows  the  cluster  assignment  of  each  of  the 
300  topics  and  the  middle  figure  shows  the  clusters  sorted  in  increasing  order,  where  each  cluster  in  encoded  as 
a  unique  color.  The  right  most  figure  on  the  top  row  shows  the  actual  composition  of  each  cluster,  where  each 
point  (topic)  is  denoted  as  a  unique  color.  The  hierarchical  clustering  trees  are  shown  in  the  bottom  row.  A  single 
character  label  corresponds  to  a  single  category.  It  is  important  to  note  in  the  figure  is  that  same  labels  are  clustered 
together  with  some  patches  of  mixed  area. 
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(a)  K-means  Clustering  Analysis  of  Topics  from  Science  Category  Group 


(b)  Hierarchical  Clustering  Tree  -  Cosine  Dis-  (c)  Hierarchical  Clustering  Tree  -  Euclidean 
tance  Distance 


Figure  7.7:  Clustering  analysis  of  the  600  topics  obtained  the  “Science”  category  group,  with  100  topics  obtained 
independently  from  each  of  its  six  subcategories.  The  figure  shows  similar  plots  as  described  in  Figure  7.6. 
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Art 


Figure  7.8:  An  illustration  showing  the  set  of  P's  learned  from  each  category  in  the  source  domain  independently. 
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Figure  7.9:  An  illustration  for  comparing  documents  to  single  topics  on  a  word  simplex.  KL-distance,  being  the 
natural  distance  measure  on  a  simplex,  gives  us  the  best  classification  accuracy. 
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(f)  NYT  to  Newsgroups  Classification 


Figure  7.10:  Comparison  of  cross-domain  classification  using  KNN-Cosine  (blue)  and  our  algorithm,  KNN-Beta-KL 
(green),  with  single  domain  classification  (red).  We  see  that  our  algorithm  performs  better  than  the  conventional 
algorithm  (green  bar  is  higher  than  the  blue  bar)  for  cross-domain  classification  in  most  cases.  The  three  rows  of 
figure  show  three  different  pairs  of  domains  used  in  our  experiments. 
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Figure  7.11:  Comparison  of  two  distance  measures  (1)  KNN-Beta-Cosine  (blue)  and  (2)  KNN-Beta-KL  (yellow)  for 
comparing  the  documents  to  the  topics  obtained  from  LDA.  KL  distance  performs  better  than  the  cosine  distance 
measure  in  most  of  the  cases.  The  accuracy  shown  are  for  the  cross-domain  classification.  The  three  rows  of  figure 
show  three  different  pairs  of  domains  (Wikipedia,  NYT  and  Newsgroups)  used  in  our  experiments. 
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Figure  7.12:  Comparison  of  cross-domain  classification  using  KNN-Cosine  (blue)  and  one  second  formulation  the 
documents  are  classified  according  to  the  likelihood  given  by  an  LDA  (light  blue)  and  PLSA  (yellow)  model.  The 
LDA  and  PLSA  models  are  trained  on  source  domain  categories.  We  see  that  the  second  formulation  of  our  topic 
based  algorithm  does  not  perform  better  than  the  conventional  algorithm  (dark  blue  bar  is  higher  than  the  light  blue 
and  the  yellow  bar)  for  cross-domain  classification.  This  is  not  unexpected  as  fitting  LDA  and  PLSA  model  on  a 
new  document  gives  it  a  lot  flexibility  in  choosing  the  topics  that  generate  the  document.  As  all  the  categories  are 
related,  there  is  enough  overlap  between  the  topics  to  give  the  document  high  likelihood  under  each  model.  The 
three  rows  of  figure  show  three  different  pairs  of  domains  used  in  our  experiments. 
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Figure  7.13:  A  figure  showing  the  distance  from  the  document  to  the  set  of  topics  in  a  cluster,  where  the  cluster 
consists  of  all  the  topics  obtained  from  one  source  domain  category.  The  distance  is  computed  using  the  KL- 
distance  between  the  document  and  the  mean  of  the  cluster.  This  distance  metric,  that  we  call  “topic  centroid 
distance”  always  gives  us  a  higher  accuracy  for  cross-domain  document  classification  when  compared  with  the 
conventional  word  count  distances  and  even  when  compared  with  our  first  formulations  based  on  single  topics. 
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Figure  7.14:  Comparison  of  cross-domain  classification  using  KNN-Cosine  (blue)  and  our  third  formulation  for  topic 
based  algorithm,  KL-Mean-Distance  (green),  with  single  domain  classification  (red).  We  see  that  our  algorithm 
performs  better  than  the  conventional  algorithm  (green  bar  is  higher  than  the  blue  bar)  for  cross-domain  classification 
in  all  cases.  The  third  formulation  is  based  on  an  assumption  that  the  target  domain  models  are  generated  using  a 
mixture  of  multinomial  model  where  all  the  components  (topics)  are  equally  weighed.  In  terms  of  distance  measures, 
is  the  sum  of  the  KL-distance  between  the  document  and  all  the  topics. The  three  rows  of  figure  show  three  different 
pairs  of  domains  used  in  our  experiments. 
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Figure  7.15:  A  figure  showing  the  sum  of  all  the  distances  from  the  document  to  the  topics  in  a  cluster,  where  the 
cluster  consists  of  all  the  topics  in  one  class.  This  distance  metric,  that  we  call  “cluster  sum  distance”,  gives  us  the 
accuracy  that  is  always  higher  than  when  the  document  is  classified  using  single  topics,  as  discussed  earlier. 
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Figure  7.16:  Comparison  of  cross-domain  classification  using  KNN-Cosine  (blue)  and  our  fourth  formulation  for 
topic  based  algorithm,  KL-Mean-Distance  (green),  with  single  domain  classification  (red).  We  see  that  our  algorithm 
performs  better  than  the  conventional  algorithm  (green  bar  is  higher  than  the  blue  bar)  for  cross-domain  classification 
in  all  cases.  The  fourth  formulation  is  based  on  an  assumption  that  the  target  domain  models  are  generated  using 
an  LDA  model  with  a  very  large  a  value  that  generates  uniform  distribution  over  all  the  topics.  In  terms  of  distance 
measures,  we  show  its  equivalent  to  the  KL-distance  between  the  document  and  the  mean  of  all  the  topics.  The 
three  rows  of  figure  show  three  different  pairs  of  domains  used  in  our  experiments. 
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(c)  NYT  to  Wiki  -  Formulation  3 
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Figure  7.17:  Figure  shows  the  results  of  using  the  LDA  K-nearest  neighbor  formulations  on  the  same  domain  and 
compares  it  with  the  conventional  K-nearest  neighbor  for  the  single  domain  classification  task.  The  left  most  bar 
(dark  blue)  is  the  accuracy  obtained  by  using  the  conventional  KNN  classifer  for  cross  domain  classification.  The 
middle  two  bars  (light  blue  and  light  orange)  are  the  cross-domain  and  same  domain  accuracies  obtained  by  our 
LDA  based  KNN  classifier  formulations.  The  right  most  bar  (dark  red)  is  the  same  domain  accuracy  obtained  by  the 
conventional  KNN.  We  see  that  a  conventional  KNN,  trained  on  word  counts  based  feature  vectors,  does  outperform 
the  LDA  based  KNN  formulations  for  the  same  domain  classification  (orange  bar  is  lower  than  the  red  bar).  The 
three  rows  show  the  graphs  for  three  different  formulations  (1),  (3)  and  (4)  as  described  in  this  chapter. 
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Figure  7.18:  Comparison  of  cross-domain  classification  using  conventional  SVM  on  words  counts  (blue)  and  our 
approach  of  using  SVM  classification  on  the  topics  (green).  We  see  that  our  method  of  training  an  SVM  classifier 
using  only  the  topics  performs  better  than  the  conventional  method  of  SVMs  trained  on  word  count  vectors(green 
bar  is  higher  than  the  blue  bar)  for  cross-domain  classification  in  most  of  the  cases.  The  three  rows  of  figures  show 
three  different  pairs  of  domains  (Wikipedia,  NYT  and  Newsgroups)  used  in  our  experiments. 
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Figure  7.19:  An  illustration  showing  an  output  of  the  SVM  classifier,  when  trained  on  topics,  on  a  word  simplex. 
The  results  obtained  using  the  SVM  on  topics  give  us  better  cross-domain  accuracy  than  the  SVM  classifier  results 
obtained  using  the  conventional  word  count  feature  vectors. 
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Figure  7.20:  Figure  shows  the  results  of  using  the  LDA  SVM  on  the  same  domain  and  compares  it  with  the  conven¬ 
tional  SVM  on  a  single  domain.  The  left  most  bar  (dark  blue)  is  the  accuracy  obtained  by  using  the  conventional 
SVM  classifer  for  cross  domain  classification.  The  middle  two  bars  (light  blue  and  light  orange)  are  the  cross-domain 
and  same  domain  accuracies  obtained  by  our  LDA  based  SVM  classifier.  The  right  most  bar  (dark  red)  is  the  same 
domain  accuracy  obtained  by  the  conventional  SVM.  We  see  that  a  conventional  SVM,  trained  on  word  counts 
based  feature  vectors,  does  outperform  the  LDA  based  SVM  for  the  same  domain  classification  (orange  bar  is  lower 
than  the  red  bar). 
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Table  7.2:  A  summary  of  accuracy  obtained  from  our  methods  and  its  comparison  with  the  conventional  methods. 
The  table  shows  the  results  of  four  of  our  methods  that  show  improvement  over  the  conventional  methods.  The 
KNN-KL,  Sum-KL  and  Mean-KL  headings  correspond  to  the  formulations  (1),  (3)  and  (4)  respectively  as  described 
in  this  chapter. 


7.5  Conclusion 

In  this  chapter,  we  provided  various  formulations  for  cross-domain  document  classification 
based  on  topic  models.  We  analyze  performance  of  different  classifiers  and  the  distribution 
of  the  topics  obtained  from  the  source  domain  using  clustering  algorithms.  We  develop  four 
different  distance  metrics  based  on  different  assumptions  for  the  relationship  between  the  target 
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domain  documents  and  the  source  domain  topics.  We  show  that  all  four  formulations  perform 
better  than  the  conventional  algorithms  when  used  for  cross-domain  document  classification. 
We  further  show  the  use  of  a  classifier  such  as  SVMs  on  the  topics  from  the  source  domain 
provides  better  target  domain  document  classification  than  the  SVMs  used  on  the  word  counts 
feature  vectors  from  the  source  domain  documents.  In  the  next  chapter,  we  conclude  our  dis¬ 
sertation  and  provide  some  ideas  for  extending  this  research. 
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CHAPTER  8: 

Conclusions  and  Future  Work 


In  this  dissertation,  we  address  the  problem  of  cross-domain  document  classification.  The  cross¬ 
domain  document  classification  is  defined  as  the  ability  to  classify  unlabeled  documents  in  any 
domain  while  using  the  labeled  set  of  documents  from  a  different  domain.  We  first  establish  the 
fact  that  conventional  classification  algorithms  such  as  Support  Vector  Machines  (SVMs)  and 
K-nearest  neighbors  (KNN)  do  not  provide  accurate  classification  when  training  and  testing  set 
of  documents  are  used  from  two  different  domains  of  text.  In  this  dissertation,  we  develop  a 
new  framework  for  cross-domain  document  classification  based  on  topic  models  such  as  LDA 
model.  We  develop  and  describe  two  different  ways  of  using  LDA  for  classification,  (1)  using 
the  vectors  describing  the  distribution  of  topics  in  documents  (7  vectors)  and  (2)  using  the  topic 
as  distribution  of  words  (J3  vectors).  We  derive  a  few  different  formulations  and  distance  metrics 
for  our  methods  and  give  empirical  results  for  each  one  of  our  formulations.  We  develop  new 
classifiers  and  give  theoretical  and  empirical  justification  for  the  effectiveness  of  our  classifiers 
for  the  cross-domain  classification. 

8.1  Main  Contributions 

We  perform  experiments  with  3  different  domains,  with  each  domain  serving  as  a  target  and 
source  domain.  This  combination  of  domains  gives  us  6  different  pairs  of  domains  for  our 
experiments.  Furthermore,  we  use  different  groups  of  categories  from  each  of  the  domains  that 
gives  us  wide  variety  of  data  to  show  consistency  and  robustness  of  our  algorithms  as  well  as 
any  other  experimental  results.  This  thesis  has  four  main  contributions  that  we  list  below: 

1.  Establishing  empirical  evidence  for  the  drop  in  accuracy  using  conventional  classification 
algorithms  when  different  domains  of  text  documents  are  used  for  testing  and  training  set. 

2.  Developing  datasets  using  three  different  domains,  namely  (1)  Wikipedia,  (2)  New  York 
Times  (NYT)  and  (3)  20-Newsgroups  datasets.  We  document  a  method  to  gather  large 
number  of  documents  from  similar  categories  from  these  different  domains.  This  dataset 
can  be  used  for  other  similar  research. 

3.  Developing  specialized  algorithms  for  Wikipedia  as  the  chosen  source  domain.  Since 
Wikipedia  offers  a  vast  amount  of  text  data,  it  is  important  to  analyze  and  explore  ways 
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of  exploiting  the  information  provided  by  Wikipedia  for  text  classification.  We  use  the 
unique  structure  of  Wikipedia  articles  and  develop  algorithms  based  on  the  articles  parsed 
into  its  different  sections.  We  show  that  using  different  sections  of  the  Wikipedia  articles 
provide  us  better  accuracy  while  using  less  data  than  using  the  entire  Wikipedia  article. 

4.  Analyzing  different  ways  of  utilizing  the  topic  models  for  classification  and  developing 
four  different  classification  algorithms  that  consistently  outperform  the  conventional  clas¬ 
sifiers  for  cross-domain  document  classification.  We  derive  distant  metric  for  each  one 
of  our  four  metrics  based  on  different  assumptions  and  relationship  between  the  source 
domain  topics  and  target  domain  documents. 

8.2  Future  Work 

Our  research  in  the  cross-domain  document  classification  does  provide  classifiers,  based  on 
topic  models,  that  consistently  show  a  large  improvement  in  the  classification  accuracy  over 
conventional  algorithms.  In  this  section  we  briefly  discuss  some  of  the  ideas  that  can  be  pursued 
to  extend  the  work  on  cross-domain  classification  based  on  the  topic  models.  We  use  LDA 
model  as  our  main  topic  model.  While  using  LDA  model,  we  obtain  topics  from  the  categories 
of  the  source  domain.  These  topics  show  a  fair  amount  of  separation  among  them  and  we  use 
these  topics  to  classify  the  documents  in  the  target  domain.  In  the  future,  we  would  like  to 
devise  ways  to  enhance  this  separation.  We  can  develop  metrics  that  measure  which  topics  are 
show  a  large  overlap  with  topics  from  other  categories  and  eliminate  them  from  the  classification 
training  set.  Another  direction  of  future  work  may  be  smoothing  of  the  word  counts  data  and  the 
distributions  obtained  from  them.  Smoothing  is  often  applied  to  Naive  Bayes  (NB)  classifiers  as 
NB  classification  also  is  sensitive  to  zero  counts  in  the  training  data  [77].  In  our  formulations, 
the  distance  metrics  often  involve  KL-distance  metrics  that  involve  digamma  function  or  the 
log  function  that  show  asymptotic  behavior  near  zero  and  thus  can  be  sensitive  to  small  or  zero 
counts.  Some  methods  for  smoothing  the  text  data  for  Dirichlet  distribution  is  the  subject  of  Dr. 
Nallapati’s  doctoral  thesis  [78].  While  extracting  the  topics  from  the  source  domain,  we  can 
also  use  the  unlabeled  target  domain  as  the  source  of  prior  information.  Since  topic  models  use 
iterative  EM  algorithm  to  find  topics,  having  a  relevant  prior  or  starting  point  that  is  influenced 
by  the  target  domain  may  provide  topics  that  are  close  to  the  target  domain  and  thus  improve 
the  classification.  In  our  framework,  we  do  not  make  any  assumptions  about  the  underlying 
domains  and  the  metrics  we  derive  in  our  research  can  easily  be  adjusted  to  other  topic  models. 
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APPENDIX  A: 

Wilcoxon  Statistical  Test  Results 


Following  tables  show  the  results  of  of  the  Wilcoxon  rank  sum  test  showing  the  statistical  sig¬ 
nificance  of  the  improvements  by  our  method.  Wilcoxon  rank  sum  test  was  used  as  it  does  not 
assume  a  parametric  distribution  of  the  two  samples  [74].  The  null  hypothesis  is  that  the  ac¬ 
curacy  samples  obtained  by  the  two  methods,  (1)  conventional  method  and  (2)  our  LDA  based 
methods,  belong  to  the  same  distribution.  The  results  show  that  in  most  cases,  we  can  reject  this 
null  hypothesis  with  99%  confidence.  The  tests  are  obtained  by  comparing  the  two  accuracy 
samples  of  the  two  classifiers.  The  samples  consisted  of  string  of  l’s  and  0’s  signifying  weather 
a  particular  document  was  accuracy  classified  or  not. 

The  tables  are  laid  out  in  the  same  order  as  the  figures  in  Chapter  7.  The  p-values  are  listed  in 
the  table  represents  the  probability  that  null  hypothesis  is  true. 

A.l  Comparing  KNN-Cosine  and  KNN-LDA  Using  KL  Dis¬ 
tance 

Tables  A.l  to  A. 6  show  the  statistical  test  results  for  sub-figures  (a)  through  (f)  in  figure  7.10. 


Domains:  NYT  to  Wiki 

Category 

knncosine 

LDAKL 

p-value 

Reject  Null  (99%) 

Art6 

77.03 

82.51 

1.143140E-017 

1 

Computers5 

50.04 

47.19 

1.769157E-003 

1 

Science6 

63.41 

65.04 

8.838840E-003 

1 

Social6 

64.42 

73.71 

2.386436E-034 

1 

Politics6 

74.38 

81.42 

7.675699E-018 

1 

Table  A.l :  P-values  for  cross  domain  NYT  to  Wiki.  Between:  knncosine  and  LDAKL 
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Domains:  Wikipedia  to  NYT 

Category 

knncosine 

LDAKL 

p-value 

Reject  Null  (99%) 

Art6 

71.03 

84.18 

2.353846E-113 

1 

Computers5 

53.16 

49.22 

8.356035E-006 

1 

Science6 

68.41 

64.67 

1.287715E-007 

1 

Social6 

58.65 

70.70 

1.435695E-048 

1 

Politics6 

58.90 

62.92 

3.460969E-008 

1 

Table  A. 2:  P-values  for  cross  domain  Wikipedia  to  NYT.  Between:  knncosine  and  LDAKL 


Domains:  Newsgroups  to  Wikipedia 

Category 

knncosine 

LDAKL 

p-value 

Reject  Null  (99%) 

all  12 

64.53 

73.81 

3.043221E-047 

1 

rec4 

78.76 

83.02 

1.309288E-003 

1 

sci4 

81.97 

85.39 

3.532474E-006 

1 

politics3 

66.88 

71.55 

2.409783E-004 

1 

Table  A. 3:  P-values  for  cross  domain  Newsgroups  to  Wikipedia.  Between:  knncosine  and  LDAKL 


Domains:  Wikipedia  to  Newsgroups 

Category 

knncosine 

LDAKL 

p-value 

Reject  Null  (99%) 

all  12 

55.94 

63.55 

5.543781E-032 

1 

rec4 

62.42 

72.25 

1.167039E-020 

1 

sci4 

77.01 

68.58 

4.336410E-017 

1 

politics3 

51.74 

56.24 

1.071875E-003 

1 

Table  A. 4:  P-values  for  cross  domain  Wikipedia  to  Newsgroups.  Between:  knncosine  and  LDAKL 


Domains:  Newsgroups  to  NYT 

Category 

knncosine 

LDAKL 

p-value 

Reject  Null  (99%) 

all  12 

57.92 

70.90 

2.871597E-143 

1 

rec4 

85.41 

89.84 

1.505948E-013 

1 

sci4 

73.44 

81.29 

2.925052E-025 

1 

politics3 

68.48 

70.67 

3.060532E-002 

1 

Table  A. 5:  P-values  for  cross  domain  Newsgroups  to  NYT.  Between:  knncosine  and  LDAKL 
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Domains:  NYT  to  Newsgroups 

Category 

knncosine 

LDAKL 

p- value 

Reject  Null  (99%) 

all  12 

58.89 

62.73 

2.430838E-009 

1 

rec4 

73.54 

78.21 

1.184613E-006 

1 

sci4 

65.63 

65.30 

7.58097  IE-001 

0 

politics3 

65.41 

55.21 

4.713171E-014 

1 

Table  A. 6:  P-values  for  cross  domain  NYT  to  Newsgroups.  Between:  knncosine  and  LDAKL 
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Domains:  NYT  to  Wiki 

Category 

LDAknncosine 

LDAKL 

p-value 

Reject  Null  (99%) 

Art6 

82.23 

82.51 

6.451224E-001 

0 

Computers5 

44.76 

47.19 

7.843933E-003 

1 

Science6 

64.40 

65.04 

3.035260E-001 

0 

Social6 

72.41 

73.71 

7.518420E-002 

0 

Politics  6 

80.64 

81.42 

3.143091E-001 

0 

Table  A. 7:  P-values  for  cross  domain  NYT  to  Wiki.  Between:  LDAknncosine  and  LDAKL 


Domains:  Wikipedia  to  NYT 

Category 

LDAknncosine 

LDAKL 

p-value 

Reject  Null  (99%) 

Art6 

82.91 

84.18 

1.376694E-002 

1 

Computers5 

50.64 

49.22 

1.077030E-001 

0 

Science6 

56.75 

64.67 

3.344569E-027 

1 

Social6 

67.43 

70.70 

3.928332E-005 

1 

Politics  6 

61.75 

62.92 

1.051785E-001 

0 

Table  A. 8:  P-values  for  cross  domain  Wikipedia  to  NYT.  Between:  LDAknncosine  and  LDAKL 


Domains:  Newsgroups  to  Wikipedia 

Category 

LDAknncosine 

LDAKL 

p-value 

Reject  Null  (99%) 

all  12 

76.78 

73.81 

7.287502E-007 

1 

rec4 

82.85 

83.02 

8.931426E-001 

0 

sci4 

84.89 

85.39 

4.837795E-001 

0 

politics  3 

79.57 

71.55 

1.327821E-01 1 

1 

Table  A. 9:  P-values  for  cross  domain  Newsgroups  to  Wikipedia.  Between:  LDAknncosine  and  LDAKL 


A.2  Comparing  KNN-LDA  Using  Cosine  and  KNN-LDA  Us¬ 
ing  KL  Distance 

Tables  A. 7  to  A. 12  show  the  statistical  test  results  for  sub-figures  (a)  through  (f)  in  figure  7.11. 
These  results  show  the  comparison  of  using  KL  distance  and  Cosine  distance  measure,  defined 
as  (1  -  cosine  similarity),  with  the  LDA  topics.  We  see  that  in  general  the  difference  between 
the  two  measures  is  not  statistical  significant,  however,  the  KL  distance  does  provide  a  slightly 
better  accuracy. 
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Domains:  Wikipedia  to  Newsgroups 

Category 

LDAknncosine 

LDAKL 

p-value 

Reject  Null  (99%) 

all  12 

62.45 

63.55 

8.304009E-002 

0 

rec4 

72.32 

72.25 

9.399733E-001 

0 

sci4 

72.39 

68.58 

2.118888E-004 

1 

politics  3 

56.89 

56.24 

6.356201E-001 

0 

Table  A. 10:  P-values  for  cross  domain  Wikipedia  to  Newsgroups.  Between:  LDAknncosine  and  LDAKL 


Domains:  Newsgroups  to  NYT 

Category 

LDAknncosine 

LDAKL 

p-value 

Reject  Null  (99%) 

all  12 

69.89 

70.90 

3.701209E-002 

1 

rec4 

88.73 

89.84 

4.845328E-002 

1 

sci4 

77.93 

81.29 

3.852562E-006 

1 

politics  3 

68.16 

70.67 

1.347853E-002 

1 

Table  A.1 1 :  P-values  for  cross  domain  Newsgroups  to  NYT.  Between:  LDAknncosine  and  LDAKL 


Domains:  NYT  to  Newsgroups 

Category 

LDAknncosine 

LDAKL 

p-value 

Reject  Null  (99%) 

all  12 

64.30 

62.73 

1.322907E-002 

1 

rec4 

68.69 

78.21 

8.643282E-022 

1 

sci4 

58.50 

65.30 

5.084705E-010 

1 

politics  3 

56.47 

55.21 

3.585560E-001 

0 

Table  A. 12:  P-values  for  cross  domain  NYT  to  Newsgroups.  Between:  LDAknncosine  and  LDAKL 
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Domains:  NYT  to  Wiki 

Category 

svm 

LDAsvm 

p-value 

Reject  Null  (99%) 

Art6 

76.53 

80.12 

4.771552E-008 

1 

Computers5 

48.44 

48.99 

5.462959E-001 

0 

Science6 

66.80 

64.91 

2.144264E-003 

1 

Social6 

66.72 

74.15 

3.733569E-023 

1 

Politics6 

74.94 

82.22 

2.459978E-019 

1 

Table  A. 13:  P-values  for  cross  domain  NYT  to  Wiki.  Between:  svm  and  LDAsvm 


Domains:  Wikipedia  to  NYT 

Category 

svm 

LDAsvm 

p-value 

Reject  Null  (99%) 

Art6 

73.91 

83.96 

6.201274E-070 

1 

Computers5 

48.45 

54.56 

4.683887E-012 

1 

Science6 

61.38 

68.22 

1.45041  IE-021 

1 

Social6 

57.29 

72.91 

6.188038E-081 

1 

Politics6 

59.35 

65.50 

2.043576E-017 

1 

Table  A. 14:  P-values  for  cross  domain  Wikipedia  to  NYT.  Between:  svm  and  LDAsvm 


Domains:  Newsgroups  to  Wikipedia 

Category 

svm 

LDAsvm 

p-value 

Reject  Null  (99%) 

all  12 

76.08 

74.16 

1.358309E-003 

1 

rec4 

82.34 

82.79 

7.224166E-001 

0 

sci4 

89.10 

85.68 

2.434933E-007 

1 

politics  3 

72.62 

75.39 

2.183317E-002 

1 

Table  A. 15:  P-values  for  cross  domain  Newsgroups  to  Wikipedia.  Between:  svm  and  LDAsvm 


A.3  Comparing  SVM  Using  Counts  Features  and  SVM  Us¬ 
ing  LDA  Topics 

Tables  A.  13  to  A.  18  show  the  statistical  test  results  for  sub-figures  (a)  through  (f)  in  figure  7.18. 
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Domains:  Wikipedia  to  Newsgroups 

Category 

svm 

LDAsvm 

p-value 

Reject  Null  (99%) 

all  12 

54.81 

65.04 

1.660464E-056 

1 

rec4 

64.65 

77.02 

9.038501E-034 

1 

sci4 

76.98 

72.01 

4.08537  IE-007 

1 

politics  3 

55.36 

60.82 

6.224672E-005 

1 

Table  A. 16:  P-values  for  cross  domain  Wikipedia  to  Newsgroups.  Between:  svm  and  LDAsvm 


Domains:  Newsgroups  to  NYT 

Category 

svm 

LDAsvm 

p-value 

Reject  Null  (99%) 

all  12 

70.11 

70.38 

5.765521E-001 

0 

rec4 

88.84 

89.14 

6.003025E-001 

0 

sci4 

75.68 

81.83 

8.345365E-017 

1 

politics  3 

64.44 

69.26 

3.735993E-006 

1 

Table  A. 17:  P-values  for  cross  domain  Newsgroups  to  NYT  for  svm  and  LDAsvm 


Domains:  NYT  to  Newsgroups 

Category 

svm 

LDAsvm 

p-value 

Reject  Null  (99%) 

all  12 

54.88 

63.33 

7.173769E-039 

1 

rec4 

72.25 

82.65 

1.623395E-028 

1 

sci4 

63.45 

66.70 

2.490855E-003 

1 

politics  3 

62.54 

61.97 

6.690104E-001 

0 

Table  A. 18:  P-values  for  cross  domain  NYT  to  Newsgroups  for  svm  and  LDAsvm 
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Domains:  NYT  to  Wiki 

Category 

Cosine  counts 

KL  Mean  Distance 

p- value 

Reject  Null  (99%) 

Art6 

77.03 

80.67 

2.318729E-008 

1 

Computers5 

50.04 

52.70 

3.648596E-003 

1 

Science6 

63.41 

65.76 

1.522082E-004 

1 

Social6 

64.42 

73.44 

2.114631E-032 

1 

Politics  6 

74.38 

82.63 

2.45895  IE-024 

1 

Table  A. 19:  P-values  for  cross  domain  NYT  to  Wiki.  Between:  Cosine  counts  and  KL  Mean  Distance 


Domains:  Wikipedia  to  NYT 

Category 

Cosine  counts 

KL  Mean  Distance 

p- value 

Reject  Null  (99%) 

Art6 

71.03 

84.75 

3.291039E-124 

1 

Computers5 

53.16 

57.50 

7.719083E-007 

1 

Science6 

68.41 

68.46 

9.485027E-001 

0 

Social6 

58.65 

72.62 

1.643607E-065 

1 

Politics  6 

58.90 

65.40 

2.906596E-019 

1 

Table  A. 20:  P-values  for  cross  domain  Wikipedia  to  NYT.  Between:  Cosine  counts  and  KL  Mean  Distance 


Domains:  Newsgroups  to  Wikipedia 

Category 

Cosine  counts 

KL  Mean  Distance 

p-value 

Reject  Null  (99%) 

rec4 

78.76 

83.02 

1.309288E-003 

1 

sci4 

81.97 

85.80 

1.691840E-007 

1 

politics  3 

66.88 

72.39 

1.392973E-005 

1 

Table  A. 21 :  P-values  for  cross  domain  Newsgroups  to  Wikipedia.  Between:  Cosine  counts  and  KL  Mean  Distance 


A.4  Comparing  KNN-Cosine  and  KNN-LDA  KL  Mean  Dis¬ 
tance 

Tables  A.  19  to  A. 24  show  the  statistical  test  results  for  sub-figures  (a)  through  (f)  in  figure  7.14. 
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Domains:  Wikipedia  to  Newsgroups 

Category 

Cosine  counts 

KL  Mean  Distance 

p-value 

Reject  Null  (99%) 

rec4 

62.42 

74.14 

3.952567E-029 

1 

sci4 

77.01 

76.75 

7.893392E-001 

0 

politics  3 

51.74 

62.85 

4.375089E-016 

1 

Table  A. 22:  P-values  for  cross  domain  Wikipedia  to  Newsgroups.  Between:  Cosine  counts  and  KL  Mean  Distance 


Domains:  Newsgroups  to  NYT 

Category 

Cosine  counts 

KL  Mean  Distance 

p-value 

Reject  Null  (99%) 

rec4 

85.41 

91.00 

1.755831E-021 

1 

sci4 

73.44 

83.09 

2.508768E-038 

1 

politics  3 

68.48 

68.79 

7.567972E-001 

0 

Table  A. 23:  P-values  for  cross  domain  Newsgroups  to  NYT.  Between:  Cosine  counts  and  KL  Mean  Distance 


Domains:  NYT  to  Newsgroups 

Category 

Cosine  counts 

KL  Mean  Distance 

p-value 

Reject  Null  (99%) 

rec4 

73.54 

83.96 

7.969488E-030 

1 

sci4 

65.63 

69.85 

6.332788E-005 

1 

politics  3 

65.41 

67.12 

1.885315E-001 

0 

Table  A. 24:  P-values  for  cross  domain  NYT  to  Newsgroups.  Between:  Cosine  counts  and  KL  Mean  Distance 
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Domains:  NYT  to  Wiki 

Category 

Cosine  counts 

KL  Distance  Sum 

p-value 

Reject  Null  (99%) 

Art6 

77.03 

81.13 

2.706618E-010 

1 

Computers5 

50.04 

51.36 

1.487446E-001 

0 

Science6 

63.41 

64.56 

6.485582E-002 

0 

Social6 

64.42 

73.69 

2.996782E-034 

1 

Politics6 

74.38 

83.23 

4.787461E-028 

1 

Table  A. 25:  P-values  for  cross  domain  NYT  to  Wiki.  Between:  Cosine  counts  and  KL  Distance  Sum 


Domains:  Wikipedia  to  NYT 

Category 

Cosine  counts 

KL  Distance  Sum 

p-value 

Reject  Null  (99%) 

Art6 

71.03 

84.31 

9.824783E-116 

1 

Computers5 

53.16 

54.92 

4.506520E-002 

0 

Science6 

68.41 

66.54 

7.828963E-003 

1 

Social6 

58.65 

72.71 

2.312902E-066 

1 

Politics6 

58.90 

64.30 

1.125382E-013 

1 

Table  A. 26:  P-values  for  cross  domain  Wikipedia  to  NYT  Between:  Cosine  counts  and  KL  Distance  Sum 


Domains:  Newsgroups  to  Wikipedia 

Category 

Cosine  counts 

KL  Distance  Sum 

p-value 

Reject  Null  (99%) 

rec4 

78.76 

83.19 

8.133625E-004 

1 

sci4 

81.97 

85.94 

5.650083E-008 

1 

politic  s3 

66.88 

71.63 

1.894803E-004 

1 

Table  A. 27:  P-values  for  cross  domain  Newsgroups  to  Wikipedia.  Between:  Cosine  counts  and  KL  Distance  Sum 


A.5  Comparing  KNN-Cosine  and  KNN-LDA  KL  Sum  Dis¬ 
tance 

Tables  A.25  to  A. 30  show  the  statistical  test  results  for  sub-figures  (a)  through  (f)  in  figure  7.16. 
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Domains:  Wikipedia  to  Newsgroups 

Category 

Cosine  counts 

KL  Distance  Sum 

p-value 

Reject  Null  (99%) 

rec4 

62.42 

74.12 

5.259800E-029 

1 

sci4 

77.01 

74.97 

3.488337E-002 

0 

politic  s3 

51.74 

60.25 

5.414242E-010 

1 

Table  A. 28:  P-values  for  cross  domain  Wikipedia  to  Newsgroups.  Between:  Cosine  counts  and  KL  Distance  Sum 


Domains:  Newsgroups  to  NYT 

Category 

Cosine  counts 

KL  Distance  Sum 

p-value 

Reject  Null  (99%) 

rec4 

85.41 

91.15 

1.128994E-022 

1 

sci4 

73.44 

82.93 

5.288515E-037 

1 

politic  s3 

68.48 

70.65 

3.252476E-002 

0 

Table  A. 29:  P-values  for  cross  domain  Newsgroups  to  NYT.  Between:  Cosine  counts  and  KL  Distance  Sum 


Domains:  NYT  to  Newsgroups 

Category 

Cosine  counts 

KL  Distance  Sum 

p-value 

Reject  Null  (99%) 

rec4 

73.54 

80.51 

1.689902E-013 

1 

sci4 

65.63 

69.19 

7.664864E-004 

1 

politic  s3 

65.41 

64.03 

2.979449E-001 

0 

Table  A. 30:  P-values  for  cross  domain  NYT  to  Newsgroups.  Between:  Cosine  counts  and  KL  Distance  Sum 
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APPENDIX  B: 

Accuracy  For  Cross-Domain  Classification  Using 

Conventional  Algorithms 


Some  additional  graphs  for  the  cross  domain  accuracy  drop  are  shown  in  this  appendix. 


all12  rec4  scl4  politics3 

Categories 


all  12  rec4  sci4  politics3 

Categories 


(a)  caption:  Wikipedia  to  Newsgroups 


(b)  caption:  News2News 


(c)  caption:  Newsgroups  to  Wikipedia 


(d)  caption:  Wiki2Wiki 


Figure  B.1 :  Classification  accuracy  drop  between  domains  Wikipedia  and  Newsgroups  using  conventional  classifiers 
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(a)  caption:  Newsgroups  to  NYT 


(b)  caption:  NYT2NYT 


(c)  caption:  NYT  to  Newsgroups 


(d)  caption:  News2News 


Figure  B.2:  Classification  accuracy  drop  between  domains  New  York  Times  and  Newsgroups  using  conventional 
classifiers 
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